Stage 06 — Transformers
The transformer is the architecture every modern AI model rests on. This stage takes you from “what is self-attention?” to “I built GPT in 300 lines of PyTorch.”
If you only deeply understand one stage in this path, make it this one.
Prerequisites
- Stage 03 (neural networks, MLPs, residuals)
- Stage 04 (language modeling, why transformers)
- Stage 05 (tokens, embeddings)
Learning ladder
- Self-attention (KQV) — the central mechanism
- Multi-head attention — multiple “views” of the input
- Positional encoding — sinusoidal, learned, RoPE, ALiBi
- The transformer block — attention + MLP + residual + norm
- GPT from scratch — minimal PyTorch implementation
MVU
You can:
- Draw a transformer block on a whiteboard from memory
- Explain what
softmax(QKᵀ/√d) Vactually does — with shapes - Distinguish encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformers
- Implement a single transformer block in <50 lines of PyTorch
Exercise
Implement a 6-layer GPT in <300 lines of PyTorch. Train on TinyShakespeare (~1MB). Generate plausibly Shakespearean text. Compare your loss to a public reference (Karpathy’s nanoGPT is the gold standard).
Why this stage matters
You can build a lot with prompting + RAG + APIs without ever writing attention from scratch. You’ll be a more dangerous engineer if you do. Almost every weird LLM behavior eventually traces to something in this stage — context windows, attention sinks, KV caching, positional encoding limits.
Hands-on companions
The transformer is the one stage on this site with a full hands-on companion track. After the theory here, build the same architecture from scratch:
- /build/05 — implement self-attention — QKV projections in ~50 lines of PyTorch
- /build/06 — multi-head attention — reshape into H heads, each learning different patterns
- /build/07 — the transformer block — pre-norm + attention + MLP + residual
- /build/08 — wire up GPT — stack N blocks, train, watch perplexity drop
If you’re shipping something and only need to know how attention works operationally, /ship/14 — cost and latency covers KV cache, prefix caching, and quantization without making you implement attention yourself.
See also
- Stage 07 — Modern LLMs — what’s been added since 2017
- Stage 10 — Fine-tuning — how we adapt transformers