demo

The math behind attention and embeddings

Drag two vectors. Watch their dot product, angle, and projection update in real time. Same operation that drives cosine similarity, attention scores, and every embedding-based retrieval system.

Why two formulas, one number?

a · b can be computed two ways:

  • Algebraically: aₓ·bₓ + aᵧ·bᵧ — fast, what computers do.
  • Geometrically: ‖a‖·‖b‖·cos θ — what it MEANS.

The geometric form is the lesson: dot product is how aligned the vectors are, scaled by their lengths. Two vectors pointing the same way → big positive dot. Perpendicular → zero. Opposite → big negative.

Try this — predict before you drag

  1. Drag b until it's perpendicular to a (cos θ = 0). Predict: dot product collapses to exactly 0, regardless of the vector lengths. This is why orthogonal embeddings have zero similarity — perpendicularity = "no shared direction".
  2. Now drag b to point opposite a (cos θ = −1). Predict: dot product becomes a large negative number. For attention, this means the query is "anti-aligned" with the key — before softmax, it gets the lowest score.
  3. Stretch b to be 3× as long, in the same direction. Predict: dot product triples, but cos θ stays at 1. That's why production retrieval normalizes embeddings before comparing — you want direction similarity, not magnitude. Cosine sim drops the magnitude.
  4. Drag a and b to be parallel and unit-length. Predict: dot ≈ 1.0, cos θ ≈ 1.0. After softmax over many keys, this is the key that wins the most attention probability.

From dot product to attention weight

Attention takes a query vector q and N key vectors k₁ … k_N, computes q · kᵢ / √d for each (the dot product, scaled), then softmaxes over the N scores:

scores[i] = q · kᵢ / √d
weights[i] = softmax(scores)[i]
            = exp(scores[i]) / Σⱼ exp(scores[j])
output     = Σᵢ weights[i] · vᵢ

Each softmax weight is what you saw in the Attention Inspector. The vectors you're dragging here are the same shape as one query–key pair, just in 2D so you can see them.

Where you'll meet this again

  • Cosine similarity — same as dot product after normalizing both vectors to unit length. See Embedding Playground.
  • Attention scores — every QKV cell is the dot product of a query vector with a key vector. See Attention Inspector.
  • Projection — the orange line. Used in PCA, whitening, and basically all dimensionality reduction.

Anchored to 01-math-foundations/linear-algebra.