Stage 06 — Transformers

The transformer is the architecture every modern AI model rests on. This stage takes you from “what is self-attention?” to “I built GPT in 300 lines of PyTorch.”

If you only deeply understand one stage in this path, make it this one.

Prerequisites

Stage 03 (neural networks, MLPs, residuals)
Stage 04 (language modeling, why transformers)
Stage 05 (tokens, embeddings)

Learning ladder

Self-attention (KQV) — the central mechanism
Multi-head attention — multiple “views” of the input
Positional encoding — sinusoidal, learned, RoPE, ALiBi
The transformer block — attention + MLP + residual + norm
GPT from scratch — minimal PyTorch implementation

MVU

You can:

Draw a transformer block on a whiteboard from memory
Explain what softmax(QKᵀ/√d) V actually does — with shapes
Distinguish encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformers
Implement a single transformer block in <50 lines of PyTorch

Exercise

Implement a 6-layer GPT in <300 lines of PyTorch. Train on TinyShakespeare (~1MB). Generate plausibly Shakespearean text. Compare your loss to a public reference (Karpathy’s nanoGPT is the gold standard).

Why this stage matters

You can build a lot with prompting + RAG + APIs without ever writing attention from scratch. You’ll be a more dangerous engineer if you do. Almost every weird LLM behavior eventually traces to something in this stage — context windows, attention sinks, KV caching, positional encoding limits.

Hands-on companions

The transformer is the one stage on this site with a full hands-on companion track. After the theory here, build the same architecture from scratch:

/build/05 — implement self-attention — QKV projections in ~50 lines of PyTorch
/build/06 — multi-head attention — reshape into H heads, each learning different patterns
/build/07 — the transformer block — pre-norm + attention + MLP + residual
/build/08 — wire up GPT — stack N blocks, train, watch perplexity drop

If you’re shipping something and only need to know how attention works operationally, /ship/14 — cost and latency covers KV cache, prefix caching, and quantization without making you implement attention yourself.