demo · animated + interactive
How attention reads a sentence
Pick a sentence. The visualization shows GPT-2 small's attention weights — exactly what the model computed when it read that sentence. Scrub through layers and heads to watch attention specialize; click a token to see its top-5 keys.
The math you're seeing
Every cell A[q,k] in the heatmap is the attention
weight from query token q to key token k:
A = softmax( Q · Kᵀ / √d ) # row-normalized, sums to 1
output[q] = A[q] · V # weighted sum over value vectors
For GPT-2 small: each token's Q, K, V
vectors are 64-dim per head (12 heads × 64 = 768-dim residual). The √d normalization
keeps the dot products from saturating softmax as d grows.
Try this
- Layer 0 vs layer 11. Pick the coreference sentence ("The trophy doesn't fit…"). At layer 0, attention is mostly diagonal-ish — tokens attending to themselves and immediate neighbors (positional). At layer 11, click "it" — the top key should be "trophy" or "suitcase", not a nearby word. Predict: which one wins depends on the sentence's grammar; check the model's call.
- The first column. On almost any sentence, scroll to layer 4–8 and watch how heavily many query tokens attend to position 0 (the BOS-like first token). This is the famous "attention sink" pattern — heads that have nothing useful to do dump their probability mass on position 0. Predict: ≥ 30% mass on column 0 in 1–2 heads per layer.
- Causal mask. The upper triangle is always exactly black. Try clicking the last token; its top-5 keys span the whole sentence. Click the first token; its only option is itself. Predict: the first token's attention is always 1.0 on itself, with zero entropy.
- Specialization. Same sentence, slide the head from 0 to 11 in a fixed layer. Some heads are diagonal (positional), some are broadcasters (one column dominates), some are vertical bars (one row dominates). Predict: by layer 4 you'll see at least 3 distinct patterns across the 12 heads.
What's not yet visible
The Q and K vectors themselves aren't surfaced in the UI yet — you see the result of softmax(QKᵀ/√d), not the dot products that produced it. Future v3 will add a hover panel showing the selected query token's Q-vector + the top-5 K-vectors and their dot products, so the entire computation is on screen.
How the data flows
The model's twelve transformer blocks each emit attention tensors
of shape [heads × tokens × tokens]. We pre-compute
these offline with PyTorch (see
scripts/precompute_attention.py) and ship the result
as a single ~600 KB JSON the browser fetches on load.
Why pre-computed: the standard ONNX export of GPT-2 used
by transformers.js doesn't expose attention outputs —
only logits and the KV cache. Running PyTorch directly is the
cleanest way to extract real attention weights. A future v2 will
swap in a custom ONNX export with attention outputs so the demo
can run on arbitrary user-typed sentences in the browser.
Anchored to 06-transformers/self-attention-kqv
from the learning path.