build · hands-on · pytorch
Build your own tiny LLM from scratch.
A code-along walkthrough — open the article in one tab, your editor in another. By the end you've written every line of a working 10M-parameter transformer that generates coherent toy text. Every concept cross-references the matching article and interactive demo, so the math, the visualization, and the code are one click apart.
What you end up with
The 17-step curriculum
Three phases, each a coherent run. Live steps render now; stubs mark what's coming. Numbers are stable — you can bookmark step 05 today and it'll still be step 05 next month.
Foundations
math + tokenizer + data
- 00 Set the tableWhat we'll build, why this curriculum, and how to get your environment ready.
- 01 The math you actually needFive ideas — vectors, matmul, softmax, cross-entropy, gradients — and where each one shows up in our code.
- 02 Build a BPE tokenizerTrain your own byte-pair encoding from scratch — no external library, ~90 lines of Python.
- 03 Tokenize TinyStories, build batchesDownload the dataset, train the BPE on it, save token IDs once, and yield (input, target) batches the model can learn from.
The Model
attention → transformer → GPT
- 04 Embeddings and positional encodingTwo lookup tables — one for token, one for position — added together to start the residual stream.
- 05 Scaled dot-product attention, from scratchThe keystone of every transformer, in 30 lines of PyTorch.
- 06 Multi-head attentionRun n_heads attentions in parallel — efficiently, with one big projection matrix and a reshape.
- 07 The transformer blockLayerNorm → MultiHead → residual → LayerNorm → MLP → residual. The pattern stacked N times to make a deep transformer.
- 08 Assemble the GPT classEmbeddings + N transformer blocks + final norm + LM head + tied weights. The whole model in 60 lines.
- 09 The training loopAdamW with weight-decay groups, warmup + cosine LR schedule, gradient clipping, periodic eval, checkpoints.
Make It Real
sampling, scaling, tuning, eval, ship
- 10 Sampling and decodingGreedy, temperature, top-k, top-p (nucleus). Same model, four very different writers.
- 11 Scale it up: 1M → 10M → 100M paramsSame architecture, three sizes. What changes (and what doesn't) when you grow the model. Chinchilla, in practice.
- 12 Fine-tune with LoRAAdapt your trained base model to a new behavior — with 800× fewer trainable parameters than full fine-tuning.
- 13 Evaluate honestlyThree lenses on model quality — perplexity, generation samples, LLM-as-judge. None complete on its own; together they let you make decisions instead of vibes.
- 14 Inference: KV cache + ONNX exportTwo operations that make generation production-fast: cache K and V across steps, then export the whole model to a portable ONNX file.
- 15 Run it in your browserCapstone — load your trained ONNX model into onnxruntime-web and generate text from a single HTML page. No Python, no server.
- 16 Where to go from hereTwelve threads to pull on, ranked by leverage. The base model is no longer mysterious; the rest of the field is the next ~dozen weekends.
How this fits the rest of the site
Every step cross-references the matching theory article and interactive demo. When you implement attention in step 05, the page links to the math derivation and the Attention Inspector. The site becomes a multi-modal study environment: code in your editor, math in one tab, real-model visualization in another.
You're not expected to have read the theory articles first. The build track is self-contained — but if anything in a step feels shaky, the matching theory + demo are exactly one click away.
Status: 17 live · 0 wip · 0 stubbed. The full ~3-month roadmap is set; articles ship in three releases (foundations → model → make it real).