production-stack production 15 / 17 28 min read · 45 min hands-on

step 15 · ship · production

Deploy it for real

Modal vs Replicate vs a $20/mo VPS vs a serverless GPU. Trade-offs, configs, and the deploy command.

deploymentproduction

You’ve got a service that works on your laptop. You’ve got observability, evals, and tuned cost/latency. The only thing between you and a real deployment is picking a place to put it. The choice depends on your traffic shape and your tolerance for ops.

The four-way choice

There’s no “best” — only “best for your shape.” Here’s what actually matters:

TargetCold startAlways-on costScale-to-zeroGPU optionsOps effortBest when
Modal~30 s$0 idleyesT4, A10, A100, H100minimalbursty traffic; weekend/personal projects; rapid iteration
Replicate~30 s$0 idleyesT4, A40, A100minimalshipping a public demo; you want a hosted UI in 5 minutes
RunPod (svrs)~30 s$0 idleyeswide range, cheaplowcost-sensitive; willing to tinker; need cheap A100 hours
VPS + GPUnone$200–800/monoone box, your choicemediumsteady traffic; predictable spend; control freaks (in good way)
(reference) AWS/GCPnone$1500+/mopossiblefull menuhighenterprise scale; existing AWS/GCP shop; have an SRE team

A few things this table makes clear that aren’t in most “deploy your LLM” posts:

  • Cold start is real. 30 seconds from a sleep state to a warm GPU. Tolerable for a webhook or a batch job, brutal for a chatbot. Mitigations come up below.
  • Always-on GPU is expensive. A single A10 ($0.30–0.60/hr) is $200–400/mo if it’s always running. An A100 is $1500+/mo. Sticker shock catches people late.
  • AWS / GCP only makes sense at scale. If you’re solo or a small team, the ops effort dwarfs the platform’s value. We won’t cover them in this article — pick Modal or RunPod, ship faster.

We’ll work through Modal in detail (it’s what most readers should pick) and sketch the others.

Modal lets you deploy Python code to GPUs with a decorator. The mental model: “I have a function. Run it on an A10. Stay warm for 5 minutes after the last call. Scale to zero.” Three lines of config; deploy is modal deploy.

The deploy file

# deploy/modal_app.py
import modal

# Mount your repo into the container.
image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install_from_pyproject("pyproject.toml")
    .add_local_dir("stack", remote_path="/app/stack")
    .add_local_dir("evals", remote_path="/app/evals")
)

# A persistent volume for the model weights — downloaded once, reused.
volume = modal.Volume.from_name("models", create_if_missing=True)

app = modal.App("stack")


@app.function(
    image=image,
    gpu="A10G",
    volumes={"/cache": volume},
    secrets=[modal.Secret.from_name("stack-secrets")],
    timeout=600,
    # Stay warm 5 minutes after a request; scale to zero after.
    container_idle_timeout=300,
    # Up to 4 concurrent containers; each handles up to 8 in-flight requests.
    keep_warm=1,                       # one container always warm
    allow_concurrent_inputs=8,
)
@modal.asgi_app()
def fastapi_app():
    """Serve stack/server.py:app as an ASGI endpoint."""
    import os
    os.environ["HF_HOME"] = "/cache/hf"
    os.environ["LLM_BACKEND"] = "vllm"
    # Lazy import so cold starts don't pay for the import unless we run.
    from stack.server import app
    return app

A few things worth pointing at:

  • gpu="A10G" — Modal’s cheapest reasonable LLM GPU. ~$0.45/hr. Bumps the model to A100 with gpu="A100". You can also do gpu=modal.gpu.A100(count=2) for multi-GPU.
  • volumes={"/cache": volume} — model weights download once to the persistent volume; subsequent cold starts mount it instead of re-downloading. Cuts cold-start from minutes to seconds.
  • keep_warm=1 — one container is always running, so the first request after a quiet period is fast. Costs ~$300/mo at A10 prices for true 24/7 warm. Set to 0 for pure scale-to-zero (cheapest), 1 for fast first-request (still cheap), 5 for high traffic.
  • allow_concurrent_inputs=8 — vLLM can handle multiple in-flight requests via continuous batching; tell Modal to send up to 8 at once instead of spinning up new containers per request. Throughput up, cost down.

The startup hook

vLLM is heavy to load. Tell Modal to load it once per container, not per request:

# deploy/modal_app.py (continued)
@app.cls(
    image=image,
    gpu="A10G",
    volumes={"/cache": volume},
    container_idle_timeout=300,
    allow_concurrent_inputs=8,
)
class StackService:

    @modal.enter()
    def load_model(self):
        """Run once when the container starts."""
        import subprocess, os
        os.environ["HF_HOME"] = "/cache/hf"
        # Boot vLLM as a subprocess; it'll bind to localhost:8000.
        self.vllm = subprocess.Popen([
            "vllm", "serve",
            "meta-llama/Llama-3.1-8B-Instruct",
            "--gpu-memory-utilization", "0.85",
            "--enable-prefix-caching",
        ])
        # Wait for vLLM to be ready (poll its /health endpoint).
        import time, httpx
        for _ in range(60):
            try:
                if httpx.get("http://localhost:8000/health").status_code == 200:
                    break
            except httpx.ConnectError:
                time.sleep(2)

    @modal.method()
    def chat(self, messages: list, **kwargs) -> dict:
        from stack.llm import LLM, VLLM_CONFIG
        llm = LLM(VLLM_CONFIG)
        return llm.chat(messages, **kwargs)

For most readers, start with the simpler @modal.asgi_app() pattern from earlier and let vLLM run inside Modal’s web container. The class-based pattern is for when you want fine-grained startup control.

Deploying

# One-time
modal token new

# Set your secrets
modal secret create stack-secrets \
  STACK_API_KEYS=prod-key-1,prod-key-2 \
  HF_TOKEN=hf_xxxxx

# Deploy
modal deploy deploy/modal_app.py

# Output:
# View Deployment: https://modal.com/apps/your-username/stack
# Web URL: https://your-username--stack-fastapi-app.modal.run

Hit the URL:

curl https://your-username--stack-fastapi-app.modal.run/v1/chat/completions \
  -H "Authorization: Bearer prod-key-1" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

First call: 30+ seconds (cold start). Second call within 5 minutes: 1–2 seconds. You’re live.

Path B — Replicate

Replicate is closer to a “model-as-a-product” platform. You package your inference code in a Cog config, push it, and Replicate hosts it with a built-in playground UI. Best for shipping a public demo where you want users to be able to interact without API keys.

# cog.yaml
build:
  gpu: true
  cuda: "12.1"
  python_version: "3.11"
  python_packages:
    - "vllm==0.6.0"
    - "fastapi==0.115.0"
    - "uvicorn==0.30.0"
    - "httpx==0.27.0"
predict: "predict.py:Predictor"
# predict.py
from cog import BasePredictor, Input
import subprocess, time, httpx


class Predictor(BasePredictor):
    def setup(self):
        self.proc = subprocess.Popen([
            "vllm", "serve",
            "meta-llama/Llama-3.1-8B-Instruct",
            "--gpu-memory-utilization", "0.85",
        ])
        for _ in range(60):
            try:
                if httpx.get("http://localhost:8000/health").status_code == 200:
                    break
            except httpx.ConnectError:
                time.sleep(2)

    def predict(self, prompt: str = Input(description="User question")) -> str:
        r = httpx.post(
            "http://localhost:8000/v1/chat/completions",
            json={
                "model": "meta-llama/Llama-3.1-8B-Instruct",
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.0,
            },
            timeout=60,
        )
        return r.json()["choices"][0]["message"]["content"]
cog login
cog push r8.im/your-username/stack

You get a URL with a built-in UI, an API endpoint, and per-second billing. The trade-off vs Modal: less control over the FastAPI shape (Replicate likes single-input-single-output predict() calls), and the UI is great-looking-but-rigid. Pick Replicate for public demos; pick Modal for production APIs.

Path C — RunPod serverless

RunPod is the cheapest of the scale-to-zero platforms — A100 hours often half what Modal charges. The trade-off: more setup, less polished tooling, smaller community.

The mental model: you build a Docker image, push it to a registry, register it as a “serverless endpoint,” RunPod auto-scales workers. Each worker hits a single /runsync endpoint that takes a JSON payload and returns a JSON response.

# deploy/Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
WORKDIR /app
RUN apt-get update && apt-get install -y python3-pip
COPY . /app
RUN pip install -r requirements.txt
CMD ["python", "-m", "stack.runpod_handler"]
# stack/runpod_handler.py
import runpod
from stack.llm import LLM


llm = LLM()  # Instantiated once per worker.


def handler(event):
    body = event["input"]
    response = llm.chat(
        messages=body["messages"],
        temperature=body.get("temperature", 0.7),
    )
    return {"response": response}


runpod.serverless.start({"handler": handler})

Push the image, point a RunPod serverless endpoint at it, set min_workers=0, max_workers=4. You get a /runsync URL and per-second GPU billing.

Path D — Just use a VPS

Sometimes the right answer is “rent a box and put your service on it.” The math:

  • A single H100 spot instance on Lambda Cloud: ~$2.50/hr, $1800/mo always-on.
  • A single A100 on Lambda: ~$1.20/hr, $864/mo.
  • A single A10 on a small cloud (Hetzner, OVH, Latitude): ~$0.40/hr, $300/mo.

If you have steady traffic that keeps a GPU above ~30% utilization 24/7, a VPS is cheaper than scale-to-zero. The Modal-tax is the convenience premium; if you don’t need it, skip it.

# On a fresh Ubuntu 22.04 + NVIDIA driver box:
git clone your-repo.git
cd your-repo
docker compose up -d

# docker-compose.yml runs vllm + your fastapi server + nginx + certbot.

That’s the whole deploy story for a VPS. Plus:

  • A systemd unit to restart the service on crash.
  • nginx + certbot to terminate TLS at a real domain.
  • A backup of your sqlite-vec store so retrieval data isn’t lost on a disk failure.
  • Fail2ban or Cloudflare to absorb script-kiddie traffic.

None of this is hard. Most teams that go VPS appreciate that they’re back in control of the box. The downside: you’re now running a server, with all the 2-a.m. surprises that implies.

What to do for every deploy target

A short checklist that applies regardless of where you put the service:

  1. Pin your model version. “meta-llama/Llama-3.1-8B-Instruct” is fine; “Llama-3.1-8B-Instruct@v1.2” or a sha256 is better. Avoid surprises when HuggingFace pushes a tokenizer update.
  2. Pin your dependencies. Lock-file your pyproject.toml. Reproducibility is a deploy concern.
  3. Health endpoint that hits the model. A /healthz that runs a 1-token generation. Catches “service is up but model is broken” failures that shallow health checks miss.
  4. Smoke-test in CI before deploying. The eval suite from step 13 runs against the new container, not the local dev one. Catch regressions before they hit users.
  5. Trace from minute one. The Phoenix endpoint env var goes in your secret store; your service emits production traces from the first deploy. Don’t ship blind.

What we’re not covering

Two things that come up that we’re skipping for good reason:

  • Kubernetes for inference. Yes, you can run vLLM in a K8s cluster. Almost no one should at the scale you’re at. The ops cost outweighs the win. Kubernetes is the right answer at 50+ GPUs and a dedicated infra team; it’s the wrong answer at 1–4 GPUs and a small team. If you’re not sure, don’t.
  • Multi-region deploys. Latency for users far from your one GPU. Solvable, but every cross-region replication scheme adds complexity. Most products are fine with one region until they have 100K+ DAU. Defer.

Cross-references

What we did and didn’t do

What we did:

  • Picked four real deploy targets, rated them honestly on cost / cold-start / ops effort
  • Wrote the Modal config in detail (most readers’ starting point)
  • Sketched the others (Replicate, RunPod, VPS) so the choice is informed
  • A deploy-checklist that applies to every target

What we didn’t:

  • Auto-scaling rules. Each platform has its own knobs (concurrency targets, queue depth thresholds). Defaults are sane for most workloads; tune only when you see queueing in your traces.
  • Blue-green and canary deploys. Most platforms support traffic-splitting; we kept it out for clarity. Step 13’s eval pipeline + a 5% canary is the pattern; ~50 lines of platform-specific config to wire up.
  • CDN for static assets. Your service might serve a frontend; that’s a separate problem with well-known answers (Vercel, Cloudflare Pages). Not LLM-specific.
  • Compliance (SOC 2, HIPAA, etc.). Real concerns at scale; not in scope for getting a first product live.

Next

Step 16 is the wrap — where to go from here. We’ve covered the foundations (steps 0–5), the building blocks (6–11), and the production layer (12–15). What’s left? LoRA fine-tuning, distillation, multi-tenant serving, post-training (RLHF / DPO), and the agentic patterns we didn’t have time for. Step 16 lays out the natural next steps with honest takes on which are worth the time investment for which goals.