demo · animated + interactive

How attention reads a sentence

Pick a sentence. The visualization shows GPT-2 small's attention weights — exactly what the model computed when it read that sentence. Scrub through layers and heads to watch attention specialize; click a token to see its top-5 keys.

The math you're seeing

Every cell A[q,k] in the heatmap is the attention weight from query token q to key token k:

A = softmax( Q · Kᵀ / √d )    # row-normalized, sums to 1
output[q] = A[q] · V          # weighted sum over value vectors

For GPT-2 small: each token's Q, K, V vectors are 64-dim per head (12 heads × 64 = 768-dim residual). The √d normalization keeps the dot products from saturating softmax as d grows.

Try this

  1. Layer 0 vs layer 11. Pick the coreference sentence ("The trophy doesn't fit…"). At layer 0, attention is mostly diagonal-ish — tokens attending to themselves and immediate neighbors (positional). At layer 11, click "it" — the top key should be "trophy" or "suitcase", not a nearby word. Predict: which one wins depends on the sentence's grammar; check the model's call.
  2. The first column. On almost any sentence, scroll to layer 4–8 and watch how heavily many query tokens attend to position 0 (the BOS-like first token). This is the famous "attention sink" pattern — heads that have nothing useful to do dump their probability mass on position 0. Predict: ≥ 30% mass on column 0 in 1–2 heads per layer.
  3. Causal mask. The upper triangle is always exactly black. Try clicking the last token; its top-5 keys span the whole sentence. Click the first token; its only option is itself. Predict: the first token's attention is always 1.0 on itself, with zero entropy.
  4. Specialization. Same sentence, slide the head from 0 to 11 in a fixed layer. Some heads are diagonal (positional), some are broadcasters (one column dominates), some are vertical bars (one row dominates). Predict: by layer 4 you'll see at least 3 distinct patterns across the 12 heads.

What's not yet visible

The Q and K vectors themselves aren't surfaced in the UI yet — you see the result of softmax(QKᵀ/√d), not the dot products that produced it. Future v3 will add a hover panel showing the selected query token's Q-vector + the top-5 K-vectors and their dot products, so the entire computation is on screen.

How the data flows

The model's twelve transformer blocks each emit attention tensors of shape [heads × tokens × tokens]. We pre-compute these offline with PyTorch (see scripts/precompute_attention.py) and ship the result as a single ~600 KB JSON the browser fetches on load.

Why pre-computed: the standard ONNX export of GPT-2 used by transformers.js doesn't expose attention outputs — only logits and the KV cache. Running PyTorch directly is the cleanest way to extract real attention weights. A future v2 will swap in a custom ONNX export with attention outputs so the demo can run on arbitrary user-typed sentences in the browser.

Anchored to 06-transformers/self-attention-kqv from the learning path.