06

stage · curriculum

Transformers

If you only deeply understand one stage, make it this one. Self-attention, multi-head, positional encoding, the block, then a 300-line GPT in PyTorch. Almost every LLM oddity eventually traces back here.

5 articles
20 min to read
6 demos
3 books
if you only do one thing

Attention + MLP + residual + norm is the central object of modern AI. Implement it in 50 lines, then read every other paper with new eyes.

Articles in this stage

  1. 01 GPT From Scratch
  2. 02 Multi-Head Attention
  3. 03 Positional Encoding
  4. 04 Self-Attention (KQV)
  5. 05 The Transformer Block

Stage 06 — Transformers

The transformer is the architecture every modern AI model rests on. This stage takes you from “what is self-attention?” to “I built GPT in 300 lines of PyTorch.”

If you only deeply understand one stage in this path, make it this one.

Prerequisites

  • Stage 03 (neural networks, MLPs, residuals)
  • Stage 04 (language modeling, why transformers)
  • Stage 05 (tokens, embeddings)

Learning ladder

  1. Self-attention (KQV) — the central mechanism
  2. Multi-head attention — multiple “views” of the input
  3. Positional encoding — sinusoidal, learned, RoPE, ALiBi
  4. The transformer block — attention + MLP + residual + norm
  5. GPT from scratch — minimal PyTorch implementation

MVU

You can:

  • Draw a transformer block on a whiteboard from memory
  • Explain what softmax(QKᵀ/√d) V actually does — with shapes
  • Distinguish encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformers
  • Implement a single transformer block in <50 lines of PyTorch

Exercise

Implement a 6-layer GPT in <300 lines of PyTorch. Train on TinyShakespeare (~1MB). Generate plausibly Shakespearean text. Compare your loss to a public reference (Karpathy’s nanoGPT is the gold standard).

Why this stage matters

You can build a lot with prompting + RAG + APIs without ever writing attention from scratch. You’ll be a more dangerous engineer if you do. Almost every weird LLM behavior eventually traces to something in this stage — context windows, attention sinks, KV caching, positional encoding limits.

Hands-on companions

The transformer is the one stage on this site with a full hands-on companion track. After the theory here, build the same architecture from scratch:

If you’re shipping something and only need to know how attention works operationally, /ship/14 — cost and latency covers KV cache, prefix caching, and quantization without making you implement attention yourself.

See also

Further reading

Books move slower than papers in this field — treat these as foundations, not replacements for the latest research. Real authors, real publishers, real editions. Free badges mark books with author-authorized full text online.

  1. Natural Language Processing with Transformers cover

    Natural Language Processing with Transformers

    Lewis Tunstall, Leandro von Werra, Thomas Wolf

    O'Reilly, Revised ed., 2023

    The HuggingFace book. Architecture-first treatment with working code.

  2. Hands-On Large Language Models cover

    Hands-On Large Language Models

    Jay Alammar, Maarten Grootendorst

    O'Reilly, 2024

    Visual, practical, including Alammar's classic Illustrated Transformer diagrams in book form.