Stage 06 — Transformers

The transformer is the architecture every modern AI model rests on. This stage takes you from “what is self-attention?” to “I built GPT in 300 lines of PyTorch.”

If you only deeply understand one stage in this path, make it this one.

Prerequisites

  • Stage 03 (neural networks, MLPs, residuals)
  • Stage 04 (language modeling, why transformers)
  • Stage 05 (tokens, embeddings)

Learning ladder

  1. Self-attention (KQV) — the central mechanism
  2. Multi-head attention — multiple “views” of the input
  3. Positional encoding — sinusoidal, learned, RoPE, ALiBi
  4. The transformer block — attention + MLP + residual + norm
  5. GPT from scratch — minimal PyTorch implementation

MVU

You can:

  • Draw a transformer block on a whiteboard from memory
  • Explain what softmax(QKᵀ/√d) V actually does — with shapes
  • Distinguish encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformers
  • Implement a single transformer block in <50 lines of PyTorch

Exercise

Implement a 6-layer GPT in <300 lines of PyTorch. Train on TinyShakespeare (~1MB). Generate plausibly Shakespearean text. Compare your loss to a public reference (Karpathy’s nanoGPT is the gold standard).

Why this stage matters

You can build a lot with prompting + RAG + APIs without ever writing attention from scratch. You’ll be a more dangerous engineer if you do. Almost every weird LLM behavior eventually traces to something in this stage — context windows, attention sinks, KV caching, positional encoding limits.

Hands-on companions

The transformer is the one stage on this site with a full hands-on companion track. After the theory here, build the same architecture from scratch:

If you’re shipping something and only need to know how attention works operationally, /ship/14 — cost and latency covers KV cache, prefix caching, and quantization without making you implement attention yourself.

See also