demo
Twelve heads, twelve specialists
Each layer of GPT-2 has 12 attention heads running in parallel. They aren't redundant. Each one learned to look at a different thing — diagonals, broadcasters, previous-token trackers, semantic roles. Pick a sentence, slide a layer, and see all twelve at once.
What you'll see
- Diagonal-leaning heads. A bright diagonal line means the head pays attention to the current token only — an identity pass.
- Previous-token heads. A diagonal one column to the left of center: the head's job is "what was the last word?" — useful for syntactic chaining.
- Broadcasters. A bright vertical stripe in the first column means every query is dumping mass on the very first token. Often a "null" or "rest position" — see Anthropic's attention sink papers.
- Semantic specialists. Bright cells far from the diagonal that line up with meaning, not position. Try the coreference example at higher layers — you'll see "she" reach across the sentence to "Alice".
Try this — predict before you click
- Pick the coreference sentence ("The trophy doesn't fit…"). Slide layer from 0 to 11. Predict: at layer 0, all 12 heads look diagonal-ish (positional). By layer 11, you should see at least 3 distinct shapes — a broadcaster (column 0 dominates), a previous-token head (off-by-one diagonal), and a long-range semantic head.
- At layer 11, find the head that lights up cells far from the diagonal. Click it for the focused view. Predict: the bright cells line up with meaningful word pairs ("it" → "trophy" or "suitcase"), not nearby words.
- Compare layer 0 head 0 with layer 11 head 7 on the same sentence. Predict: layer 0 looks almost trivial (positional); layer 11 looks intricate. The depth of specialization is monotonic in layer index — early layers do mechanics, late layers do semantics.
- Across all 12 heads in any layer, count how many are pure broadcasters (column 0 takes most of the mass). Predict: 1–3 per layer. These are "attention sinks" — heads that have nothing useful to do for some tokens, so they dump probability on position 0.
The pattern hint
The small label on each thumbnail (broadcaster, diagonal-leaning, etc.) is a one-line heuristic
classification based on where mass concentrates. It's a starting
point, not a category — most heads are mixtures.
Why this matters
Understanding head specialization is the entry point to mechanistic interpretability. If you can name what a head does, you can sometimes steer the model by intervening on it. Anthropic's circuits research, Neel Nanda's TransformerLens, and the entire field of interpretability starts here: looking at a head and asking "what shape did this learn?"
Anchored to 06-transformers/multi-head-attention
from the learning path. Reuses the data that powers the Attention Inspector.