ship · production · ops
Ship a production LLM stack from open source.
A code-along walkthrough for the 80% of LLM work that isn't pretraining: pick a base model, run it locally with vLLM, build an eval harness, wire RAG and tools, ship it with observability and cost tuning. Every concept cross-references the matching article and demo on this site, so the theory and the visualization are one click away.
Engineers who deploy, not pretrain
The 18-step curriculum
Three release tranches. Foundations is the one that ships first; Building and Production come next. Numbers are stable — bookmark step 03 today and it'll still be step 03 next month.
Foundations
pick a model · run it locally · evaluate it · wrap it as an API
- 00 What we're shippingThe 17 steps from a fresh laptop to a self-hosted, evaluated, instrumented, deployed LLM application.
- 01 Pick your base modelLlama-3 vs Qwen-2.5 vs Mistral vs Phi — what differs, what doesn't, which one we'll use for the rest of the curriculum.
- 02 Run a model locally with OllamaFrom zero to your first generation in five minutes. Then we'll graduate to vLLM in step 03.
- 03 Switch to vLLMThe production-grade inference engine. Same OpenAI schema, ~3× the throughput, proper continuous batching, paged KV cache.
- 04 Build an eval harnesslm-eval-harness for benchmarks + a custom task-specific eval you write yourself. The 'is it any good' question, answered programmatically.
- 05 Wrap as a FastAPI service/v1/chat/completions of your own — streaming, API-key auth, structured logging, health checks. The endpoint your application code talks to.
Building
RAG · vector stores · tools · agent loops · multi-agent
- 06 RAG, the production wayWhat chunking actually is, why naive splits fail in production, and the four strategies that hold up. The unglamorous step that makes or breaks every RAG system.
- 07 Embeddings + vector storePick an embedding model, persist with sqlite-vec, retrieve at 5 ms. Plus: why you probably don't need a dedicated vector DB.
- 08 Retrieval: BM25 + dense + rerankingThe three-stage pipeline that beats any single strategy. Hybrid retrieval, RRF fusion, and a cross-encoder reranker as the silent quality lever.
- 09 Tools and function callingJSON-schema tool definitions, OSS-model adapters, and the structured-output dance that makes agents possible.
- 10 Build an agent loopReAct + state management + tool execution + termination. Why most agents are simpler than they look.
- 11 Multi-agent orchestrationSupervisor + workers, fan-out / fan-in, when to use it, when one agent is enough. Patterns for reliability.
Production
observability · evals in prod · cost/latency · deploy patterns
- 12 Observability with PhoenixTrace every model call, every retrieval, every tool. The waterfall view production AI requires.
- 13 Evaluation in productionA/B testing prompts, drift detection, golden-set regression. How to know your model got worse before users do.
- 14 Cost and latency tuningPrompt cache, KV reuse, continuous batching, quantization, speculative decoding. The five levers behind every serving optimization.
- 15 Deploy it for realModal vs Replicate vs a $20/mo VPS vs a serverless GPU. Trade-offs, configs, and the deploy command.
- 16 Where to go from hereDistillation, post-training, multi-tenant serving, agentic patterns — what's next once the basics are live.
- 17 Synthetic data + distillationCompress a frontier model into a small specialist for ~10× the cost reduction. The pipeline behind every cost-conscious production deploy.
How this fits the rest of the site
Every step cross-references the matching theory article and demo. When you wire RAG in step 06, the article links to RAG Fundamentals and the RAG Visualizer. When you build the observability layer in step 12, the Observability Trace demo shows what the spans look like. The site becomes a multi-modal study environment: shell + editor in some tabs, math + visualizations in others.
Status: 18 live · 0 wip · 0 stubbed. Ships in three releases (foundations → building → production).