step 03 · ship · foundations

Switch to vLLM

The production-grade inference engine. Same OpenAI schema, ~3× the throughput, proper continuous batching, paged KV cache.

modelinferenceproduction

Ollama got you tokens fast. It’s the right tool for “is the model running?” and for low-throughput single-user development. It’s the wrong tool for serving multiple concurrent users — which any production deployment does.

This step swaps the inference engine to vLLM, the de-facto production standard. Same model, same schema, ~3× the throughput on a single user, much better than that under concurrent load. Then we’ll measure the difference.

Why vLLM

The engine difference comes down to two ideas, both of which Ollama (via llama.cpp) handles less well:

Continuous batching

Naive serving processes requests one at a time: receive request, generate all tokens, return, take next request. Static batching groups requests of similar length to share the matmul cost — but you wait for the slowest in the batch to finish before starting the next batch.

Continuous batching (Yu et al. 2022) interleaves at the token level. Each request slot can finish independently, freed slots immediately accept new requests, and at any moment you’re processing N concurrent generation streams in one matmul. This is the single biggest throughput optimization in modern LLM serving.

For 32 concurrent users hitting your service: ~5× higher tokens/sec than serial; ~3× higher than naive batching.

PagedAttention

The KV cache (we covered it in /build step 14) holds ~75 MB per token of generated context per layer for a 70B model. With 32 concurrent users at 4K context each, that’s a lot of memory. Naive serving allocates a contiguous slab per request, sized for the worst-case context — which wastes most of it for short requests.

PagedAttention (Kwon et al. 2023) breaks the KV cache into fixed-size pages (typically 16 tokens), allocated on demand. It’s the same trick virtual memory uses for RAM. Net effect: ~2× more concurrent requests fit in the same GPU.

vLLM was the first engine to ship both. Ollama (and llama.cpp underneath) does some batching but not continuous; the gap shows up when you have multiple users.

Install vLLM

Two paths. Docker is the easiest if you have a GPU.

Path 1: Docker (Linux/WSL2 with NVIDIA GPU)

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

That single command:

Pulls the vLLM image
Mounts your HuggingFace cache (so the model downloads once, persists)
Exposes port 8000
Loads Llama-3.1-8B with an 8K context cap (default is the model’s full 128K, which uses way more memory than you need)
Caps GPU memory at 85% utilization (leaves headroom for OS / other processes)

First boot takes 1–10 minutes depending on whether you’ve cached the weights. You’ll see logs ending in:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Path 2: pip install (Linux + CUDA)

uv pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --port 8000

Same flags, no Docker. Faster to iterate on if you’re tweaking config.

Path 3: macOS / no GPU

vLLM on Apple Silicon is experimental and slow. For laptop development, stay on Ollama — the OpenAI schema means application code doesn’t change. When you deploy (step 15), you’ll deploy to a Linux GPU instance and use vLLM there.

If you really want vLLM locally on macOS for testing, the vllm-openai Docker image does run on Apple Silicon (slowly, CPU-only) — a 7B model gets ~3 tokens/sec, which is fine for sanity-checking schema compatibility but useless for real load.

HuggingFace gating

Llama-3.1-8B is gated on HuggingFace — you have to accept the license once before downloading. If you see OSError: You don't have access to ..., log in:

hf auth login

…paste a token from huggingface.co/settings/tokens, then accept the license at huggingface.co/meta-llama/Llama-3.1-8B-Instruct. One-time annoyance.

If you’d rather skip gating entirely, swap the model arg to Qwen/Qwen2.5-7B-Instruct — Apache 2.0, no acceptance needed.

Sanity check

vLLM exposes the same /v1/chat/completions endpoint Ollama did. Same curl, just a different port:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Reply with the word OK."}]
  }'

Note the model field uses the full HuggingFace identifier, not the short Ollama tag. Otherwise byte-identical to step 02.

Expected response shape:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "OK"},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 14, "completion_tokens": 1, "total_tokens": 15}
}

That’s the OpenAI schema, identical to what Ollama returned. Your application code from step 02 doesn’t care which engine produced the bytes.

Update the Python client

The only change to stack/ollama_client.py is the URL. Make a copy and re-point:

# stack/vllm_client.py
from stack.ollama_client import chat as _chat, stream as _stream
import functools

VLLM_URL = "http://localhost:8000/v1/chat/completions"
DEFAULT_MODEL = "meta-llama/Llama-3.1-8B-Instruct"


# Bind the URL + model defaults; otherwise reuse the existing functions.
def _patched_chat(messages, model=DEFAULT_MODEL, **kwargs):
    return _patched(_chat, messages, model, **kwargs)


def _patched_stream(messages, model=DEFAULT_MODEL, **kwargs):
    return _patched(_stream, messages, model, **kwargs)


def _patched(fn, messages, model, **kwargs):
    # Monkey-patch the URL by editing the module global. Ugly but
    # keeps step 02's code unchanged. Cleaner refactor in step 05.
    import stack.ollama_client as oc
    original = oc.OLLAMA_URL
    oc.OLLAMA_URL = VLLM_URL
    try:
        return fn(messages, model=model, **kwargs)
    finally:
        oc.OLLAMA_URL = original


chat = _patched_chat
stream = _patched_stream

This is intentionally ugly — we’re going to refactor it cleanly in step 05 when we wrap everything in our own FastAPI service. For now, the point is: the client logic is unchanged. The URL swap is the only thing that matters.

A nicer refactor (which we’ll formalize in step 05) factors out the URL:

# stack/llm.py — the cleaner version we'll commit in step 05
from typing import Iterator
import json
import httpx

class LLM:
    def __init__(self, base_url: str, model: str):
        self.base_url = base_url.rstrip("/")
        self.model = model

    def chat(self, messages, **kwargs) -> str:
        with httpx.Client(timeout=120.0) as client:
            r = client.post(
                f"{self.base_url}/chat/completions",
                json={"model": self.model, "messages": messages, "stream": False, **kwargs},
            )
            r.raise_for_status()
            return r.json()["choices"][0]["message"]["content"]

    def stream(self, messages, **kwargs) -> Iterator[str]:
        with httpx.Client(timeout=None) as client:
            with client.stream("POST",
                f"{self.base_url}/chat/completions",
                json={"model": self.model, "messages": messages, "stream": True, **kwargs},
            ) as r:
                r.raise_for_status()
                for line in r.iter_lines():
                    if not line.startswith("data: "): continue
                    data = line[6:]
                    if data == "[DONE]": break
                    delta = json.loads(data)["choices"][0].get("delta", {})
                    if "content" in delta:
                        yield delta["content"]


# Two configs, one client class.
ollama = LLM("http://localhost:11434/v1", "llama3.1:8b")
vllm   = LLM("http://localhost:8000/v1",  "meta-llama/Llama-3.1-8B-Instruct")

Use whichever in your code. We’ll commit this version in step 05 and standardize.

Throughput comparison

This is the moment the swap pays off. Run a small concurrency test against each engine, with both running simultaneously (Ollama on 11434, vLLM on 8000):

# scratch/bench.py
import asyncio
import time
import httpx

PROMPTS = [
    "Write a haiku about morning coffee.",
    "Explain photosynthesis in one sentence.",
    "What's 2 + 2?",
    "List five blue things.",
    "Translate 'hello' to French and Spanish.",
    "Why is the sky blue?",
    "Write the first line of a mystery novel.",
    "What's the capital of Australia?",
] * 4   # 32 concurrent prompts


async def hit(client, base_url, model, prompt):
    r = await client.post(
        f"{base_url}/chat/completions",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 80,
        },
        timeout=120.0,
    )
    return r.json()["usage"]["completion_tokens"]


async def benchmark(label, base_url, model):
    print(f"\n── {label} ──")
    async with httpx.AsyncClient() as client:
        t0 = time.time()
        results = await asyncio.gather(*[
            hit(client, base_url, model, p) for p in PROMPTS
        ])
        elapsed = time.time() - t0
    total_tokens = sum(results)
    print(f"  {len(PROMPTS)} concurrent requests")
    print(f"  total tokens: {total_tokens}")
    print(f"  wall time:    {elapsed:.2f}s")
    print(f"  throughput:   {total_tokens / elapsed:.1f} tok/s")


async def main():
    await benchmark("Ollama", "http://localhost:11434/v1", "llama3.1:8b")
    await benchmark("vLLM",   "http://localhost:8000/v1",  "meta-llama/Llama-3.1-8B-Instruct")


if __name__ == "__main__":
    asyncio.run(main())

Run it (with both servers running):

uv run python scratch/bench.py

Approximate output on a single A100, 32 concurrent requests:

── Ollama ──
  32 concurrent requests
  total tokens: 2034
  wall time:    32.4s
  throughput:   62.8 tok/s

── vLLM ──
  32 concurrent requests
  total tokens: 2068
  wall time:    11.7s
  throughput:   176.8 tok/s

~3× throughput at 32-concurrency. On bigger batch sizes the gap widens further. On a single request the gap is much smaller (vLLM is maybe 1.2× faster than Ollama because it’s still doing one stream at a time).

The throughput multiplier scales with concurrency. That’s the production point: vLLM doesn’t make a single user faster, it makes lots of users not slow each other down.

Memory: vLLM loads eagerly

A subtle difference in operational behavior: Ollama loads a model lazily on the first request and unloads it after OLLAMA_KEEP_ALIVE (default 5 min) of idle. vLLM loads at startup and stays loaded until you stop the process.

Practical implications:

vLLM uses memory continuously. The 8B model + KV cache occupies ~16 GB on GPU regardless of whether requests are coming in.
First-request latency is much better. No cold start; the second-request latency is the only latency.
Restarting is slower. Re-loading the model from disk on container restart adds 30s–2min depending on hardware.

For production, you want eager loading. For laptop development with one app at a time, lazy loading is more convenient. Use Ollama on your laptop, vLLM on your server — same client code either way.

Cross-references

KV Cache demo — toggles caching on and off, shows the per-token compute curve. PagedAttention is what makes the cache fit at scale.
Cost & Latency Calculator demo — the throughput numbers you just measured slot directly into the calculator’s “self-hosted” mode
Inference Pipeline demo — the forward pass we instrumented in /build step 14 is exactly what vLLM is running, just at production scale
/build step 14 (KV cache + ONNX export) — the underlying technique vLLM operationalizes

What we did and didn’t do

What we did:

Stood up vLLM serving the same model Ollama was, on a different port
Verified OpenAI-schema compatibility (existing client code unchanged)
Measured ~3× throughput improvement at 32-concurrency
Sketched the cleaner LLM client class we’ll formalize in step 05

What we didn’t:

Configure speculative decoding. vLLM supports it via --speculative-model draft-model-7b. Adds ~2× latency improvement on top of throughput. Worth it in production; orthogonal to the basics here.
Tune --gpu-memory-utilization. We picked 0.85; sometimes 0.95 squeezes more KV cache and helps. Tune empirically with your actual workload.
Multi-GPU tensor parallelism. --tensor-parallel-size 2 shards a model across 2 GPUs. Necessary for 70B+ on consumer hardware. Out of scope for foundations.
Run vLLM behind a load balancer. For high availability you’d run multiple vLLM instances behind nginx or an API gateway. That’s deployment-tier (step 15), not foundations.
Use TGI, TensorRT-LLM, or SGLang. All viable alternatives to vLLM. TensorRT-LLM is faster on NVIDIA hardware specifically; SGLang has interesting structured-generation features. Pick vLLM for breadth of model support; switch later if profiling demands it.

Step 04 is evaluation. We’ve got a model running fast. Now we measure: is it any good? lm-eval-harness for academic benchmarks (MMLU, HellaSwag, GSM8K), plus a custom task-specific eval you write yourself. Both run against either Ollama or vLLM via the OpenAI schema you already have set up.