tiny-llm 16 / 16 12 min read

step 16 · build

Where to go from here

Twelve threads to pull on, ranked by leverage. The base model is no longer mysterious; the rest of the field is the next ~dozen weekends.

wrap

You’ve finished the curriculum. Sixteen steps. About 600 lines of Python that compose into a working transformer. A trained model on TinyStories. A LoRA-fine-tuned variant. An ONNX export. A browser demo. Every line you wrote.

The natural question is: now what? Modern LLMs are full of architectural nuances and training tricks we deliberately skipped to keep the curriculum focused. None of them are conceptually harder than what you’ve already implemented; most are local refinements with clear motivation.

This is a tour, not a tutorial. Each item is one paragraph plus a pointer to the canonical paper or implementation. The order is roughly by leverage — what would you replace first if you were rewriting tiny_llm to feel “modern.”

The architecture cabinet

1. RoPE (Rotary Position Embeddings)

Step 04 used learned absolute positional embeddings — fine, but bounded by max_seq_len. RoPE (Su et al. 2021) replaces them by rotating the Q and K vectors before the attention dot product, with rotation angle proportional to position. The dot product Q_i · K_j then naturally encodes the relative offset (i − j) rather than absolute positions.

Why it’s the modern default (LLaMA, Qwen, GPT-NeoX, Mixtral): better extrapolation past the trained context length, and it lets you use techniques like YaRN to extend context further at inference time. The Positional Encoding Lab compares all four schemes (sinusoidal / learned / RoPE / ALiBi) on the same input.

Implementation: ~30 LOC swap inside forward_cached of MultiHeadAttention. You’d remove the pos_emb from Embed entirely.

2. RMSNorm instead of LayerNorm

Step 07 used nn.LayerNorm. RMSNorm (Zhang & Sennrich 2019) drops the mean-subtraction step:

RMSNorm(x) = x · γ / sqrt(mean(x²) + ε)

Slightly fewer FLOPs, slightly less memory, identical training quality at our scale. LLaMA, Mistral, and Qwen all use it. Three-line change to Block.

3. SwiGLU MLP

Step 07’s MLP was Linear → GELU → Linear. The SwiGLU variant (Shazeer 2020, used by LLaMA):

SwiGLU(x) = Linear_3(silu(Linear_1(x)) ⊙ Linear_2(x))

Three linears instead of two; the extra cost is offset by reducing d_ff from 4·d_model to 8/3 · d_model for parameter parity. Modestly better quality per parameter. ~20 LOC change to MLP.

4. Multi-Query / Grouped-Query Attention

Step 06’s MHA uses one K and one V per head. MQA (Shazeer 2019) shares one K, V across all heads — drops the KV cache by n_heads×, hugely valuable at inference. GQA (Ainslie et al. 2023) is the in-between: groups of heads share K, V. LLaMA-2 70B uses GQA-8.

Worth doing the moment your KV cache is the dominant inference cost — i.e., long-context production serving. The architecture change is ~15 LOC in mha.py.

5. Mixture of Experts (MoE)

Replace each MLP with n_experts parallel MLPs and a learned router that dispatches each token to top-K experts. Mixtral 8×7B has 47B total parameters but only 13B activate per token; the MoE Routing demo walks through the mechanism. Big quality jump per inference cost.

The implementation is more involved (~150 LOC plus a load-balancing loss term), but the conceptual addition is small once you have a working transformer block.

6. Long context (sliding window, sparse attention)

Step 05’s attention is O(N²) in sequence length. For 128k+ context windows you need something cheaper. The Long Context demo shows the standard mask patterns; Mistral uses sliding-window attention, GPT-4 rumored to use blocked sparse, Mamba/Mamba-2 ditches attention for state-space layers entirely.

Probably not worth implementing yourself unless you have a use case that demands it; libraries like FlashAttention are the right tool.

Post-training

7. SFT (Supervised Fine-Tuning) at real scale

Step 12 fine-tuned on 200 toy examples. Production SFT uses 50k–500k carefully-curated (instruction, response) pairs. Datasets to know: Alpaca (52k, GPT-4-generated), Dolly (15k, human-written), OASST1 (cleaner, conversation-style), UltraChat (200k+).

Mechanically identical to step 12. The skill is dataset construction and curation, not the loss function. The HuggingFace trl library wraps the rest.

8. DPO (Direct Preference Optimization)

After SFT, models still produce confident wrong answers. DPO (Rafailov et al. 2023) trains directly on (prompt, chosen_response, rejected_response) triples, pulling the model toward chosen and pushing it from rejected. Simpler than RLHF — no reward model or PPO, just a clever loss function:

loss = − log σ(β · (log π_θ(chosen | prompt) − log π_θ(rejected | prompt) − reference_term))

The RLHF Preference demo shows the underlying preference-modeling math. DPO replaces RLHF in most modern post-training stacks (LLaMA-3, Mistral models, almost everything except OpenAI itself).

9. RLHF with PPO

The classic preference-tuning method: train a reward model from preference data, then use PPO to optimize the LLM against that reward. Slower and more fragile than DPO, but historically what made ChatGPT a chatbot. The Hugging Face TRL library implements it cleanly; OpenAI’s RLHF paper is the canonical reference.

Mostly displaced by DPO and GRPO for new training. Still worth understanding because production safety teams use it for adversarial robustness.

Inference & deployment

10. FlashAttention

Step 14 used a naive O(T²) attention computation. FlashAttention (Dao 2022) computes the same attention, exact result, but in a memory-aware way that doesn’t materialize the T×T weight matrix. ~3× speedup, allows bigger batches and longer contexts.

Drop-in replacement: F.scaled_dot_product_attention(q, k, v, is_causal=True) since PyTorch 2.0. You should swap to it the moment you scale past our 5M model.

11. Speculative decoding

Use a small “draft” model to propose k tokens; the big model verifies them in parallel. If most are accepted, you’ve decoded k tokens at the cost of one big-model forward pass. ~2–3× speedup with no quality loss. The Cost & Latency Calculator demo compares throughput with and without it.

Implementation is not architectural — it’s a serving-stack change. vLLM and SGLang both support it.

12. Quantization (int8, int4, GPTQ, AWQ)

Step 14 mentioned dynamic int8 quantization. The next tier: GPTQ (Frantar et al. 2022) and AWQ (Lin et al. 2023) both compress to 4-bit while preserving quality much better than naive rounding, by considering the distribution of activations. The Quantization Lab demo shows what 4-bit weights look like.

Run a 70B-param LLaMA in 35 GB instead of 140 GB. The standard hobbyist deployment path.

Reading list, by leverage

If you have one weekend after this curriculum, read the original transformer paper (Vaswani et al. 2017) — same architecture you implemented, with the original motivation.

If you have one month:

  1. Karpathy’s nanoGPT — same scope as our build track but in 300 lines. Read for taste differences. Karpathy’s zero-to-hero series is the video version.
  2. Tri Dao’s FlashAttention paper — once you understand attention, FlashAttention is genuinely beautiful.
  3. The LLaMA-3 paper (Meta 2024) — production-tier transformer with full training-stack disclosure. Compare each section to what we built.
  4. Anthropic’s Circuits work — once you can build a transformer, you can want to take one apart. transformer-circuits.pub is the foundational mechanistic-interpretability work.

If you have a year, write your own version of every paper above, in the same way you’ve written the model.

Two things to actually do next

  1. Train SMALL on TinyStories (step 11 config). Two GPU-hours and you get a model that produces genuinely fluent toy stories. That’s a different feeling than the TINY model from step 09 — confidence that the architecture and recipe scale.

  2. Pick a specific paper from above and reimplement that one piece. Add RoPE to your model. Or swap LayerNorm for RMSNorm. Or implement DPO on top of your SFT. Each one is a weekend; each one teaches more than reading the paper alone.

The curriculum claims completion. The skill, like all engineering skills, is open-ended.

What you have now

Concretely, after sixteen steps:

tiny_llm/
├── tokenize.py          ~90 lines  — BPE from scratch
├── data.py              ~80 lines  — TinyStories pipeline
├── embed.py             ~25 lines  — token + position embedding
├── attention.py         ~40 lines  — single-head attention
├── mha.py               ~50 lines  — multi-head attention + KV cache
├── block.py             ~60 lines  — transformer block
├── gpt.py              ~120 lines  — full model + sampling + generation
├── train.py            ~150 lines  — training loop
├── lora.py              ~50 lines  — LoRA adapter
├── finetune.py          ~80 lines  — fine-tuning loop
├── eval.py              ~80 lines  — three-lens eval
└── export.py            ~30 lines  — ONNX + tokenizer serialization

browser-demo/
├── index.html           ~50 lines
├── app.js              ~120 lines  — JS BPE + sampling + ONNX inference
├── tokenizer.json      ~80 KB
└── tiny_llm.onnx       ~21 MB

About 1,000 lines of code, total. From "I want to understand transformers"
to "I trained one and run it in my browser."

That’s the artifact. Not novel research — the field built every piece before you. But the understanding is yours, line by line. When you read about a 100B-param frontier model now, every section of the paper maps to lines of your own repo, scaled differently. Architecture isn’t where the next decade of progress will be; data, post-training, and tooling are. The thing you needed to learn — that the underlying machinery is legible — you’ve learned.

Cross-references for the road

The whole site, if you want to drill on anything:

  • Articles — 109 theory articles across 15 stages
  • Demos — 56 interactive demos, 13 with narrated walkthroughs
  • Learn — the curriculum view, four tracks

And, when in doubt, the transformer-block article and the Inference Pipeline demo are the two things to keep open in another tab while you scale up.

Thanks for finishing. Now go build.