demo

Twelve heads, twelve specialists

Each layer of GPT-2 has 12 attention heads running in parallel. They aren't redundant. Each one learned to look at a different thing — diagonals, broadcasters, previous-token trackers, semantic roles. Pick a sentence, slide a layer, and see all twelve at once.

What you'll see

  • Diagonal-leaning heads. A bright diagonal line means the head pays attention to the current token only — an identity pass.
  • Previous-token heads. A diagonal one column to the left of center: the head's job is "what was the last word?" — useful for syntactic chaining.
  • Broadcasters. A bright vertical stripe in the first column means every query is dumping mass on the very first token. Often a "null" or "rest position" — see Anthropic's attention sink papers.
  • Semantic specialists. Bright cells far from the diagonal that line up with meaning, not position. Try the coreference example at higher layers — you'll see "she" reach across the sentence to "Alice".

Try this — predict before you click

  1. Pick the coreference sentence ("The trophy doesn't fit…"). Slide layer from 0 to 11. Predict: at layer 0, all 12 heads look diagonal-ish (positional). By layer 11, you should see at least 3 distinct shapes — a broadcaster (column 0 dominates), a previous-token head (off-by-one diagonal), and a long-range semantic head.
  2. At layer 11, find the head that lights up cells far from the diagonal. Click it for the focused view. Predict: the bright cells line up with meaningful word pairs ("it" → "trophy" or "suitcase"), not nearby words.
  3. Compare layer 0 head 0 with layer 11 head 7 on the same sentence. Predict: layer 0 looks almost trivial (positional); layer 11 looks intricate. The depth of specialization is monotonic in layer index — early layers do mechanics, late layers do semantics.
  4. Across all 12 heads in any layer, count how many are pure broadcasters (column 0 takes most of the mass). Predict: 1–3 per layer. These are "attention sinks" — heads that have nothing useful to do for some tokens, so they dump probability on position 0.

The pattern hint

The small label on each thumbnail (broadcaster, diagonal-leaning, etc.) is a one-line heuristic classification based on where mass concentrates. It's a starting point, not a category — most heads are mixtures.

Why this matters

Understanding head specialization is the entry point to mechanistic interpretability. If you can name what a head does, you can sometimes steer the model by intervening on it. Anthropic's circuits research, Neel Nanda's TransformerLens, and the entire field of interpretability starts here: looking at a head and asking "what shape did this learn?"

Anchored to 06-transformers/multi-head-attention from the learning path. Reuses the data that powers the Attention Inspector.