production-stack foundations 00 / 17 10 min read · 5 min hands-on

step 00 · ship · foundations

What we're shipping

The 17 steps from a fresh laptop to a self-hosted, evaluated, instrumented, deployed LLM application.

setup

There are good ways to learn how an LLM works. The /build curriculum walks you through writing one from scratch — every line of attention, every byte of the tokenizer, the whole training loop. By the end you can read an arXiv paper and know which lines of your repo correspond to each section.

But there’s a gap between “I understand how a transformer works” and “I shipped an LLM application that real users hit and it didn’t fall over.” Most working AI engineers in 2026 don’t pretrain models. They build applications on top of OSS or API base models. That’s what this track is for.

By the end you’ll have:

  • A self-hosted Llama-3-8B (or Qwen-2.5, or Mistral) running on your machine via vLLM
  • An eval harness that grades it against benchmarks plus your own task-specific tests
  • A FastAPI service exposing a streaming chat endpoint
  • A RAG pipeline with chunkers, dense + BM25 retrieval, and a cross-encoder reranker
  • Tools / function-calling for the model to interact with your stack
  • An agent loop with state and observability
  • Cost + latency tuning (KV cache, batching, speculative decoding, quantization)
  • A deployed instance behind a real URL with monitoring and trace dashboards

Nothing here is novel research. Everything is the same architectural pattern that powers every LLM application in production today. Once you’ve shipped this track, the entire production AI stack is legible to you. When a vendor says “we use vLLM with FlashAttention and a custom retrieval layer,” you’ll know exactly which 200 lines of your code that sentence describes.

Who this is for

Three audiences, in roughly increasing distance from the current state of the field:

Engineers who already build with LLMs. You’re calling OpenAI APIs and stitching together LangChain. This curriculum gets you off SaaS and onto self-hosted, with the same or better quality at a fraction of the cost. You learn which pieces actually matter, which you can skip, and how to stop being a vendor’s hostage.

Engineers who finished /build. You wrote the model from scratch. Now you want to know what happens after pretraining: how to deploy it, how to evaluate it, how to make it fast enough for real users. This is that part.

Engineers new to AI engineering. You can skip /build. You don’t need to write a transformer from scratch to ship a transformer-backed application well. Start at step 00 here; we’ll point at the relevant theory articles as concepts come up.

What this curriculum is NOT

Three things to set expectations:

  • Not a Karpathy zero-to-hero. Karpathy’s series teaches you to build a transformer. This series teaches you to operate one. Different scope, different audience.
  • Not a LangChain quickstart. We don’t use LangChain. We use the underlying pieces directly so you understand what’s happening. (You’re free to use LangChain in your own work; you’ll just understand it better after this.)
  • Not a research paper. Everything here is published technique, applied. The novelty (if any) is in the integration — connecting tokenizer + inference + retrieval + tools + observability into a coherent system you can actually run.

The 17-step path

Three release tranches, each with a coherent endpoint:

Foundations (steps 00–05). Pick a model, run it locally with two different inference engines (Ollama for ease, vLLM for production), build an eval harness, wrap it as a streaming HTTP API. Endpoint: you have a self-hosted LLM-as-a-service running on your machine, with a /v1/chat/completions endpoint comparable to OpenAI’s.

Building (steps 06–11). Add RAG (chunking, embedding, retrieval, reranking), tools and function calling, an agent loop, multi-agent orchestration patterns. Endpoint: your service can answer questions over your own documents and execute multi-step plans with tools.

Production (steps 12–16). Observability, evaluation in production, cost and latency tuning, deployment patterns. Endpoint: it runs behind a real URL with monitoring, retries, and a cost dashboard.

The full curriculum is on the ship index. Each step is one article. Most steps are 15–25 minutes of reading and 15–60 minutes of hands-on work — installing things, writing config, running curl commands, debugging logs.

What you’ll need

The setup is more flexible than /build because we’re orchestrating tools instead of writing PyTorch from scratch. The bare minimum:

  1. A laptop with 16 GB+ RAM. Llama-3-8B in 4-bit quantization fits comfortably. If you have less, we’ll cover smaller models (Qwen-2.5-3B, Phi-3-mini) that fit in 8 GB.
  2. Python 3.11+ for the application code.
  3. curl for poking endpoints during development. (httpie works too if you prefer.)
  4. Docker (optional but useful). A few of the production tools we’ll use ship as containers. We won’t require it for the foundations release.
  5. An editor. Same as /build — VS Code, Cursor, vim, anything.

A GPU helps but isn’t required for the foundations release. For step 03 (vLLM) you’ll see a real speed difference; we’ll point at the cheapest cloud option (RunPod, Modal, or a Lambda Labs spot) for that step.

Set up the project

Create the working directory. We’ll use the same uv-based Python project pattern as /build:

uv init llm-stack
cd llm-stack

# Two top-level dirs: backend code, and a place for model weights.
mkdir -p stack data

Add a Python pyproject.toml we’ll grow into. For now:

# Add the foundational dependencies.
uv add httpx fastapi "uvicorn[standard]" pydantic

Verify:

uv run python -c "import httpx, fastapi; print('OK,', fastapi.__version__)"

Expected:

OK, 0.115.x

If you see that, you have a working FastAPI scaffold. We’ll fill in the actual server in step 05; for steps 02–04 we’ll mostly be poking external services with curl and Python clients.

How each step is structured

The pattern, same as /build:

  1. Contract callout at the top — what you’ll set up, what you’ll be able to run, prereq
  2. The big picture — why we’re doing this step, where it sits in the stack
  3. Implementation — commands, code, configs, with explanations
  4. Sanity check — a curl command, a script, a thing you run to verify it worked
  5. Cross-references — a related article on this site, a relevant demo
  6. What we did and didn’t do — explicit “here’s what’s in scope, here’s what we punted”
  7. What’s next — one-liner pointing at the next step

The voice is engineer-to-engineer, same as /build. The format is intentionally code-along — open the article in one tab, your terminal in another.

Two anti-patterns

Same as /build:

  1. Skipping the sanity checks. Each step ends with a “run this and expect this output” pairing. If you skip it, you’re reading a tutorial. If you do it, you’re operating a system. The point isn’t validation; it’s that you encounter the moment when your endpoint returns a real token and the latency is a real number.

  2. Copy-pasting whole files at once. As before — type the names yourself for the muscle memory. Most of what we write is small (config files, ~50-line Python services). Typing it isn’t slow, and it’s where the understanding lands.

Next

Step 01 is picking a base model. Llama-3, Qwen-2.5, Mistral, Phi — what differs, what doesn’t, what matters for which use cases. Five-minute decision article; you’ll have your model picked by the end. Step 02 is where the hands-on starts: pulling the model with Ollama and getting your first generation in under five minutes.