step 03 · ship · foundations
Switch to vLLM
The production-grade inference engine. Same OpenAI schema, ~3× the throughput, proper continuous batching, paged KV cache.
Ollama got you tokens fast. It’s the right tool for “is the model running?” and for low-throughput single-user development. It’s the wrong tool for serving multiple concurrent users — which any production deployment does.
This step swaps the inference engine to vLLM, the de-facto production standard. Same model, same schema, ~3× the throughput on a single user, much better than that under concurrent load. Then we’ll measure the difference.
Why vLLM
The engine difference comes down to two ideas, both of which Ollama (via llama.cpp) handles less well:
Continuous batching
Naive serving processes requests one at a time: receive request, generate all tokens, return, take next request. Static batching groups requests of similar length to share the matmul cost — but you wait for the slowest in the batch to finish before starting the next batch.
Continuous batching (Yu et al. 2022) interleaves at the token level. Each request slot can finish independently, freed slots immediately accept new requests, and at any moment you’re processing N concurrent generation streams in one matmul. This is the single biggest throughput optimization in modern LLM serving.
For 32 concurrent users hitting your service: ~5× higher tokens/sec than serial; ~3× higher than naive batching.
PagedAttention
The KV cache (we covered it in /build step 14) holds ~75 MB per token of generated context per layer for a 70B model. With 32 concurrent users at 4K context each, that’s a lot of memory. Naive serving allocates a contiguous slab per request, sized for the worst-case context — which wastes most of it for short requests.
PagedAttention (Kwon et al. 2023) breaks the KV cache into fixed-size pages (typically 16 tokens), allocated on demand. It’s the same trick virtual memory uses for RAM. Net effect: ~2× more concurrent requests fit in the same GPU.
vLLM was the first engine to ship both. Ollama (and llama.cpp underneath) does some batching but not continuous; the gap shows up when you have multiple users.
Install vLLM
Two paths. Docker is the easiest if you have a GPU.
Path 1: Docker (Linux/WSL2 with NVIDIA GPU)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--gpu-memory-utilization 0.85
That single command:
- Pulls the vLLM image
- Mounts your HuggingFace cache (so the model downloads once, persists)
- Exposes port 8000
- Loads Llama-3.1-8B with an 8K context cap (default is the model’s full 128K, which uses way more memory than you need)
- Caps GPU memory at 85% utilization (leaves headroom for OS / other processes)
First boot takes 1–10 minutes depending on whether you’ve cached the weights. You’ll see logs ending in:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Path 2: pip install (Linux + CUDA)
uv pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--port 8000
Same flags, no Docker. Faster to iterate on if you’re tweaking config.
Path 3: macOS / no GPU
vLLM on Apple Silicon is experimental and slow. For laptop development, stay on Ollama — the OpenAI schema means application code doesn’t change. When you deploy (step 15), you’ll deploy to a Linux GPU instance and use vLLM there.
If you really want vLLM locally on macOS for testing, the vllm-openai Docker image does run on Apple Silicon (slowly, CPU-only) — a 7B model gets ~3 tokens/sec, which is fine for sanity-checking schema compatibility but useless for real load.
Sanity check
vLLM exposes the same /v1/chat/completions endpoint Ollama did. Same curl, just a different port:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Reply with the word OK."}]
}'
Note the model field uses the full HuggingFace identifier, not the short Ollama tag. Otherwise byte-identical to step 02.
Expected response shape:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "OK"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 14, "completion_tokens": 1, "total_tokens": 15}
}
That’s the OpenAI schema, identical to what Ollama returned. Your application code from step 02 doesn’t care which engine produced the bytes.
Update the Python client
The only change to stack/ollama_client.py is the URL. Make a copy and re-point:
# stack/vllm_client.py
from stack.ollama_client import chat as _chat, stream as _stream
import functools
VLLM_URL = "http://localhost:8000/v1/chat/completions"
DEFAULT_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
# Bind the URL + model defaults; otherwise reuse the existing functions.
def _patched_chat(messages, model=DEFAULT_MODEL, **kwargs):
return _patched(_chat, messages, model, **kwargs)
def _patched_stream(messages, model=DEFAULT_MODEL, **kwargs):
return _patched(_stream, messages, model, **kwargs)
def _patched(fn, messages, model, **kwargs):
# Monkey-patch the URL by editing the module global. Ugly but
# keeps step 02's code unchanged. Cleaner refactor in step 05.
import stack.ollama_client as oc
original = oc.OLLAMA_URL
oc.OLLAMA_URL = VLLM_URL
try:
return fn(messages, model=model, **kwargs)
finally:
oc.OLLAMA_URL = original
chat = _patched_chat
stream = _patched_stream
This is intentionally ugly — we’re going to refactor it cleanly in step 05 when we wrap everything in our own FastAPI service. For now, the point is: the client logic is unchanged. The URL swap is the only thing that matters.
A nicer refactor (which we’ll formalize in step 05) factors out the URL:
# stack/llm.py — the cleaner version we'll commit in step 05
from typing import Iterator
import json
import httpx
class LLM:
def __init__(self, base_url: str, model: str):
self.base_url = base_url.rstrip("/")
self.model = model
def chat(self, messages, **kwargs) -> str:
with httpx.Client(timeout=120.0) as client:
r = client.post(
f"{self.base_url}/chat/completions",
json={"model": self.model, "messages": messages, "stream": False, **kwargs},
)
r.raise_for_status()
return r.json()["choices"][0]["message"]["content"]
def stream(self, messages, **kwargs) -> Iterator[str]:
with httpx.Client(timeout=None) as client:
with client.stream("POST",
f"{self.base_url}/chat/completions",
json={"model": self.model, "messages": messages, "stream": True, **kwargs},
) as r:
r.raise_for_status()
for line in r.iter_lines():
if not line.startswith("data: "): continue
data = line[6:]
if data == "[DONE]": break
delta = json.loads(data)["choices"][0].get("delta", {})
if "content" in delta:
yield delta["content"]
# Two configs, one client class.
ollama = LLM("http://localhost:11434/v1", "llama3.1:8b")
vllm = LLM("http://localhost:8000/v1", "meta-llama/Llama-3.1-8B-Instruct")
Use whichever in your code. We’ll commit this version in step 05 and standardize.
Throughput comparison
This is the moment the swap pays off. Run a small concurrency test against each engine, with both running simultaneously (Ollama on 11434, vLLM on 8000):
# scratch/bench.py
import asyncio
import time
import httpx
PROMPTS = [
"Write a haiku about morning coffee.",
"Explain photosynthesis in one sentence.",
"What's 2 + 2?",
"List five blue things.",
"Translate 'hello' to French and Spanish.",
"Why is the sky blue?",
"Write the first line of a mystery novel.",
"What's the capital of Australia?",
] * 4 # 32 concurrent prompts
async def hit(client, base_url, model, prompt):
r = await client.post(
f"{base_url}/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 80,
},
timeout=120.0,
)
return r.json()["usage"]["completion_tokens"]
async def benchmark(label, base_url, model):
print(f"\n── {label} ──")
async with httpx.AsyncClient() as client:
t0 = time.time()
results = await asyncio.gather(*[
hit(client, base_url, model, p) for p in PROMPTS
])
elapsed = time.time() - t0
total_tokens = sum(results)
print(f" {len(PROMPTS)} concurrent requests")
print(f" total tokens: {total_tokens}")
print(f" wall time: {elapsed:.2f}s")
print(f" throughput: {total_tokens / elapsed:.1f} tok/s")
async def main():
await benchmark("Ollama", "http://localhost:11434/v1", "llama3.1:8b")
await benchmark("vLLM", "http://localhost:8000/v1", "meta-llama/Llama-3.1-8B-Instruct")
if __name__ == "__main__":
asyncio.run(main())
Run it (with both servers running):
uv run python scratch/bench.py
Approximate output on a single A100, 32 concurrent requests:
── Ollama ──
32 concurrent requests
total tokens: 2034
wall time: 32.4s
throughput: 62.8 tok/s
── vLLM ──
32 concurrent requests
total tokens: 2068
wall time: 11.7s
throughput: 176.8 tok/s
~3× throughput at 32-concurrency. On bigger batch sizes the gap widens further. On a single request the gap is much smaller (vLLM is maybe 1.2× faster than Ollama because it’s still doing one stream at a time).
The throughput multiplier scales with concurrency. That’s the production point: vLLM doesn’t make a single user faster, it makes lots of users not slow each other down.
Memory: vLLM loads eagerly
A subtle difference in operational behavior: Ollama loads a model lazily on the first request and unloads it after OLLAMA_KEEP_ALIVE (default 5 min) of idle. vLLM loads at startup and stays loaded until you stop the process.
Practical implications:
- vLLM uses memory continuously. The 8B model + KV cache occupies ~16 GB on GPU regardless of whether requests are coming in.
- First-request latency is much better. No cold start; the second-request latency is the only latency.
- Restarting is slower. Re-loading the model from disk on container restart adds 30s–2min depending on hardware.
For production, you want eager loading. For laptop development with one app at a time, lazy loading is more convenient. Use Ollama on your laptop, vLLM on your server — same client code either way.
Cross-references
- KV Cache demo — toggles caching on and off, shows the per-token compute curve. PagedAttention is what makes the cache fit at scale.
- Cost & Latency Calculator demo — the throughput numbers you just measured slot directly into the calculator’s “self-hosted” mode
- Inference Pipeline demo — the forward pass we instrumented in
/buildstep 14 is exactly what vLLM is running, just at production scale /buildstep 14 (KV cache + ONNX export) — the underlying technique vLLM operationalizes
What we did and didn’t do
What we did:
- Stood up vLLM serving the same model Ollama was, on a different port
- Verified OpenAI-schema compatibility (existing client code unchanged)
- Measured ~3× throughput improvement at 32-concurrency
- Sketched the cleaner
LLMclient class we’ll formalize in step 05
What we didn’t:
- Configure speculative decoding. vLLM supports it via
--speculative-model draft-model-7b. Adds ~2× latency improvement on top of throughput. Worth it in production; orthogonal to the basics here. - Tune
--gpu-memory-utilization. We picked 0.85; sometimes 0.95 squeezes more KV cache and helps. Tune empirically with your actual workload. - Multi-GPU tensor parallelism.
--tensor-parallel-size 2shards a model across 2 GPUs. Necessary for 70B+ on consumer hardware. Out of scope for foundations. - Run vLLM behind a load balancer. For high availability you’d run multiple vLLM instances behind nginx or an API gateway. That’s deployment-tier (step 15), not foundations.
- Use TGI, TensorRT-LLM, or SGLang. All viable alternatives to vLLM. TensorRT-LLM is faster on NVIDIA hardware specifically; SGLang has interesting structured-generation features. Pick vLLM for breadth of model support; switch later if profiling demands it.
Next
Step 04 is evaluation. We’ve got a model running fast. Now we measure: is it any good? lm-eval-harness for academic benchmarks (MMLU, HellaSwag, GSM8K), plus a custom task-specific eval you write yourself. Both run against either Ollama or vLLM via the OpenAI schema you already have set up.