step 15 · ship · production
Deploy it for real
Modal vs Replicate vs a $20/mo VPS vs a serverless GPU. Trade-offs, configs, and the deploy command.
You’ve got a service that works on your laptop. You’ve got observability, evals, and tuned cost/latency. The only thing between you and a real deployment is picking a place to put it. The choice depends on your traffic shape and your tolerance for ops.
The four-way choice
There’s no “best” — only “best for your shape.” Here’s what actually matters:
| Target | Cold start | Always-on cost | Scale-to-zero | GPU options | Ops effort | Best when |
|---|---|---|---|---|---|---|
| Modal | ~30 s | $0 idle | yes | T4, A10, A100, H100 | minimal | bursty traffic; weekend/personal projects; rapid iteration |
| Replicate | ~30 s | $0 idle | yes | T4, A40, A100 | minimal | shipping a public demo; you want a hosted UI in 5 minutes |
| RunPod (svrs) | ~30 s | $0 idle | yes | wide range, cheap | low | cost-sensitive; willing to tinker; need cheap A100 hours |
| VPS + GPU | none | $200–800/mo | no | one box, your choice | medium | steady traffic; predictable spend; control freaks (in good way) |
| (reference) AWS/GCP | none | $1500+/mo | possible | full menu | high | enterprise scale; existing AWS/GCP shop; have an SRE team |
A few things this table makes clear that aren’t in most “deploy your LLM” posts:
- Cold start is real. 30 seconds from a sleep state to a warm GPU. Tolerable for a webhook or a batch job, brutal for a chatbot. Mitigations come up below.
- Always-on GPU is expensive. A single A10 ($0.30–0.60/hr) is $200–400/mo if it’s always running. An A100 is $1500+/mo. Sticker shock catches people late.
- AWS / GCP only makes sense at scale. If you’re solo or a small team, the ops effort dwarfs the platform’s value. We won’t cover them in this article — pick Modal or RunPod, ship faster.
We’ll work through Modal in detail (it’s what most readers should pick) and sketch the others.
Path A — Modal (recommended)
Modal lets you deploy Python code to GPUs with a decorator. The mental model: “I have a function. Run it on an A10. Stay warm for 5 minutes after the last call. Scale to zero.” Three lines of config; deploy is modal deploy.
The deploy file
# deploy/modal_app.py
import modal
# Mount your repo into the container.
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install_from_pyproject("pyproject.toml")
.add_local_dir("stack", remote_path="/app/stack")
.add_local_dir("evals", remote_path="/app/evals")
)
# A persistent volume for the model weights — downloaded once, reused.
volume = modal.Volume.from_name("models", create_if_missing=True)
app = modal.App("stack")
@app.function(
image=image,
gpu="A10G",
volumes={"/cache": volume},
secrets=[modal.Secret.from_name("stack-secrets")],
timeout=600,
# Stay warm 5 minutes after a request; scale to zero after.
container_idle_timeout=300,
# Up to 4 concurrent containers; each handles up to 8 in-flight requests.
keep_warm=1, # one container always warm
allow_concurrent_inputs=8,
)
@modal.asgi_app()
def fastapi_app():
"""Serve stack/server.py:app as an ASGI endpoint."""
import os
os.environ["HF_HOME"] = "/cache/hf"
os.environ["LLM_BACKEND"] = "vllm"
# Lazy import so cold starts don't pay for the import unless we run.
from stack.server import app
return app
A few things worth pointing at:
gpu="A10G"— Modal’s cheapest reasonable LLM GPU. ~$0.45/hr. Bumps the model to A100 withgpu="A100". You can also dogpu=modal.gpu.A100(count=2)for multi-GPU.volumes={"/cache": volume}— model weights download once to the persistent volume; subsequent cold starts mount it instead of re-downloading. Cuts cold-start from minutes to seconds.keep_warm=1— one container is always running, so the first request after a quiet period is fast. Costs ~$300/mo at A10 prices for true 24/7 warm. Set to 0 for pure scale-to-zero (cheapest), 1 for fast first-request (still cheap), 5 for high traffic.allow_concurrent_inputs=8— vLLM can handle multiple in-flight requests via continuous batching; tell Modal to send up to 8 at once instead of spinning up new containers per request. Throughput up, cost down.
The startup hook
vLLM is heavy to load. Tell Modal to load it once per container, not per request:
# deploy/modal_app.py (continued)
@app.cls(
image=image,
gpu="A10G",
volumes={"/cache": volume},
container_idle_timeout=300,
allow_concurrent_inputs=8,
)
class StackService:
@modal.enter()
def load_model(self):
"""Run once when the container starts."""
import subprocess, os
os.environ["HF_HOME"] = "/cache/hf"
# Boot vLLM as a subprocess; it'll bind to localhost:8000.
self.vllm = subprocess.Popen([
"vllm", "serve",
"meta-llama/Llama-3.1-8B-Instruct",
"--gpu-memory-utilization", "0.85",
"--enable-prefix-caching",
])
# Wait for vLLM to be ready (poll its /health endpoint).
import time, httpx
for _ in range(60):
try:
if httpx.get("http://localhost:8000/health").status_code == 200:
break
except httpx.ConnectError:
time.sleep(2)
@modal.method()
def chat(self, messages: list, **kwargs) -> dict:
from stack.llm import LLM, VLLM_CONFIG
llm = LLM(VLLM_CONFIG)
return llm.chat(messages, **kwargs)
For most readers, start with the simpler @modal.asgi_app() pattern from earlier and let vLLM run inside Modal’s web container. The class-based pattern is for when you want fine-grained startup control.
Deploying
# One-time
modal token new
# Set your secrets
modal secret create stack-secrets \
STACK_API_KEYS=prod-key-1,prod-key-2 \
HF_TOKEN=hf_xxxxx
# Deploy
modal deploy deploy/modal_app.py
# Output:
# View Deployment: https://modal.com/apps/your-username/stack
# Web URL: https://your-username--stack-fastapi-app.modal.run
Hit the URL:
curl https://your-username--stack-fastapi-app.modal.run/v1/chat/completions \
-H "Authorization: Bearer prod-key-1" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}]}'
First call: 30+ seconds (cold start). Second call within 5 minutes: 1–2 seconds. You’re live.
Path B — Replicate
Replicate is closer to a “model-as-a-product” platform. You package your inference code in a Cog config, push it, and Replicate hosts it with a built-in playground UI. Best for shipping a public demo where you want users to be able to interact without API keys.
# cog.yaml
build:
gpu: true
cuda: "12.1"
python_version: "3.11"
python_packages:
- "vllm==0.6.0"
- "fastapi==0.115.0"
- "uvicorn==0.30.0"
- "httpx==0.27.0"
predict: "predict.py:Predictor"
# predict.py
from cog import BasePredictor, Input
import subprocess, time, httpx
class Predictor(BasePredictor):
def setup(self):
self.proc = subprocess.Popen([
"vllm", "serve",
"meta-llama/Llama-3.1-8B-Instruct",
"--gpu-memory-utilization", "0.85",
])
for _ in range(60):
try:
if httpx.get("http://localhost:8000/health").status_code == 200:
break
except httpx.ConnectError:
time.sleep(2)
def predict(self, prompt: str = Input(description="User question")) -> str:
r = httpx.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.0,
},
timeout=60,
)
return r.json()["choices"][0]["message"]["content"]
cog login
cog push r8.im/your-username/stack
You get a URL with a built-in UI, an API endpoint, and per-second billing. The trade-off vs Modal: less control over the FastAPI shape (Replicate likes single-input-single-output predict() calls), and the UI is great-looking-but-rigid. Pick Replicate for public demos; pick Modal for production APIs.
Path C — RunPod serverless
RunPod is the cheapest of the scale-to-zero platforms — A100 hours often half what Modal charges. The trade-off: more setup, less polished tooling, smaller community.
The mental model: you build a Docker image, push it to a registry, register it as a “serverless endpoint,” RunPod auto-scales workers. Each worker hits a single /runsync endpoint that takes a JSON payload and returns a JSON response.
# deploy/Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
WORKDIR /app
RUN apt-get update && apt-get install -y python3-pip
COPY . /app
RUN pip install -r requirements.txt
CMD ["python", "-m", "stack.runpod_handler"]
# stack/runpod_handler.py
import runpod
from stack.llm import LLM
llm = LLM() # Instantiated once per worker.
def handler(event):
body = event["input"]
response = llm.chat(
messages=body["messages"],
temperature=body.get("temperature", 0.7),
)
return {"response": response}
runpod.serverless.start({"handler": handler})
Push the image, point a RunPod serverless endpoint at it, set min_workers=0, max_workers=4. You get a /runsync URL and per-second GPU billing.
Path D — Just use a VPS
Sometimes the right answer is “rent a box and put your service on it.” The math:
- A single H100 spot instance on Lambda Cloud: ~$2.50/hr, $1800/mo always-on.
- A single A100 on Lambda: ~$1.20/hr, $864/mo.
- A single A10 on a small cloud (Hetzner, OVH, Latitude): ~$0.40/hr, $300/mo.
If you have steady traffic that keeps a GPU above ~30% utilization 24/7, a VPS is cheaper than scale-to-zero. The Modal-tax is the convenience premium; if you don’t need it, skip it.
# On a fresh Ubuntu 22.04 + NVIDIA driver box:
git clone your-repo.git
cd your-repo
docker compose up -d
# docker-compose.yml runs vllm + your fastapi server + nginx + certbot.
That’s the whole deploy story for a VPS. Plus:
- A systemd unit to restart the service on crash.
- nginx + certbot to terminate TLS at a real domain.
- A backup of your sqlite-vec store so retrieval data isn’t lost on a disk failure.
- Fail2ban or Cloudflare to absorb script-kiddie traffic.
None of this is hard. Most teams that go VPS appreciate that they’re back in control of the box. The downside: you’re now running a server, with all the 2-a.m. surprises that implies.
What to do for every deploy target
A short checklist that applies regardless of where you put the service:
- Pin your model version. “meta-llama/Llama-3.1-8B-Instruct” is fine; “Llama-3.1-8B-Instruct@v1.2” or a sha256 is better. Avoid surprises when HuggingFace pushes a tokenizer update.
- Pin your dependencies. Lock-file your
pyproject.toml. Reproducibility is a deploy concern. - Health endpoint that hits the model. A
/healthzthat runs a 1-token generation. Catches “service is up but model is broken” failures that shallow health checks miss. - Smoke-test in CI before deploying. The eval suite from step 13 runs against the new container, not the local dev one. Catch regressions before they hit users.
- Trace from minute one. The Phoenix endpoint env var goes in your secret store; your service emits production traces from the first deploy. Don’t ship blind.
What we’re not covering
Two things that come up that we’re skipping for good reason:
- Kubernetes for inference. Yes, you can run vLLM in a K8s cluster. Almost no one should at the scale you’re at. The ops cost outweighs the win. Kubernetes is the right answer at 50+ GPUs and a dedicated infra team; it’s the wrong answer at 1–4 GPUs and a small team. If you’re not sure, don’t.
- Multi-region deploys. Latency for users far from your one GPU. Solvable, but every cross-region replication scheme adds complexity. Most products are fine with one region until they have 100K+ DAU. Defer.
Cross-references
- Modal docs — function decoration, GPU options, secrets
- Replicate Cog docs — packaging models for Replicate
- RunPod serverless docs — handler API, deployment
- Inference Serving article — the theory side
What we did and didn’t do
What we did:
- Picked four real deploy targets, rated them honestly on cost / cold-start / ops effort
- Wrote the Modal config in detail (most readers’ starting point)
- Sketched the others (Replicate, RunPod, VPS) so the choice is informed
- A deploy-checklist that applies to every target
What we didn’t:
- Auto-scaling rules. Each platform has its own knobs (concurrency targets, queue depth thresholds). Defaults are sane for most workloads; tune only when you see queueing in your traces.
- Blue-green and canary deploys. Most platforms support traffic-splitting; we kept it out for clarity. Step 13’s eval pipeline + a 5% canary is the pattern; ~50 lines of platform-specific config to wire up.
- CDN for static assets. Your service might serve a frontend; that’s a separate problem with well-known answers (Vercel, Cloudflare Pages). Not LLM-specific.
- Compliance (SOC 2, HIPAA, etc.). Real concerns at scale; not in scope for getting a first product live.
Next
Step 16 is the wrap — where to go from here. We’ve covered the foundations (steps 0–5), the building blocks (6–11), and the production layer (12–15). What’s left? LoRA fine-tuning, distillation, multi-tenant serving, post-training (RLHF / DPO), and the agentic patterns we didn’t have time for. Step 16 lays out the natural next steps with honest takes on which are worth the time investment for which goals.