demo
The math behind attention and embeddings
Drag two vectors. Watch their dot product, angle, and projection update in real time. Same operation that drives cosine similarity, attention scores, and every embedding-based retrieval system.
Why two formulas, one number?
a · b can be computed two ways:
- Algebraically:
aₓ·bₓ + aᵧ·bᵧ— fast, what computers do. - Geometrically:
‖a‖·‖b‖·cos θ— what it MEANS.
The geometric form is the lesson: dot product is how aligned the vectors are, scaled by their lengths. Two vectors pointing the same way → big positive dot. Perpendicular → zero. Opposite → big negative.
Try this — predict before you drag
-
Drag
buntil it's perpendicular toa(cos θ = 0). Predict: dot product collapses to exactly 0, regardless of the vector lengths. This is why orthogonal embeddings have zero similarity — perpendicularity = "no shared direction". -
Now drag
bto point oppositea(cos θ = −1). Predict: dot product becomes a large negative number. For attention, this means the query is "anti-aligned" with the key — before softmax, it gets the lowest score. -
Stretch
bto be 3× as long, in the same direction. Predict: dot product triples, but cos θ stays at 1. That's why production retrieval normalizes embeddings before comparing — you want direction similarity, not magnitude. Cosine sim drops the magnitude. -
Drag
aandbto be parallel and unit-length. Predict: dot ≈ 1.0, cos θ ≈ 1.0. After softmax over many keys, this is the key that wins the most attention probability.
From dot product to attention weight
Attention takes a query vector q and N key vectors
k₁ … k_N, computes q · kᵢ / √d for each
(the dot product, scaled), then softmaxes over the N scores:
scores[i] = q · kᵢ / √d
weights[i] = softmax(scores)[i]
= exp(scores[i]) / Σⱼ exp(scores[j])
output = Σᵢ weights[i] · vᵢ Each softmax weight is what you saw in the Attention Inspector. The vectors you're dragging here are the same shape as one query–key pair, just in 2D so you can see them.
Where you'll meet this again
- Cosine similarity — same as dot product after normalizing both vectors to unit length. See Embedding Playground.
- Attention scores — every QKV cell is the dot product of a query vector with a key vector. See Attention Inspector.
- Projection — the orange line. Used in PCA, whitening, and basically all dimensionality reduction.
Anchored to 01-math-foundations/linear-algebra.