demo

"This token is third from the start"

Self-attention is permutation-invariant — without help, it can't tell "dog bites man" from "man bites dog". Position encoding is the trick that fixes it. Four schemes; same problem; very different solutions.

The four schemes at a glance

  • Sinusoidal (Vaswani 2017) — fixed sin/cos at many frequencies. Added directly to the embedding. The original Transformer used this; it has the elegant property of encoding relative positions linearly.
  • Learned (GPT-2, GPT-3) — just a parameter table. Position 0 has its own vector; position 1 has its own vector; etc. Easiest to reason about, but the model only knows positions it saw during training.
  • RoPE (Su et al. 2021) — rotates each pair of Q/K dimensions by an angle proportional to position. After the dot product, the result depends only on relative position. LLaMA, GPT-NeoX, Qwen, and most modern open-weights models use this.
  • ALiBi (Press et al. 2022) — no embedding change at all; just adds a per-head linear bias to attention scores: closer keys are favoured, farther ones penalized. The model literally never has to know its position; it just has a gradient toward "look nearby".

Why this matters

Position encoding is one of the great underrated levers in transformer design. It's the difference between a model that can extrapolate to 1M tokens (RoPE with NTK / YARN scaling) and one that breaks at 2049 tokens (vanilla learned). It's why modern long-context models all use rotary or bias schemes.

Try this — predict before you click

  1. Sinusoidal: Move seq length from 32 to 128. Predict: the leftmost columns (low frequency) keep waving slowly without breaking — they'd extrapolate cleanly. Right columns oscillate so fast at high positions they look like noise. That separation is what lets sinusoidal encode "near vs far".
  2. Learned: The pattern is just seeded random noise — because that's what an untrained Learned table looks like. Predict: increase seq length past your "training" length and the model has no signal there at all. This is why GPT-2 breaks at 1024 tokens — it never saw position 1025.
  3. RoPE: watch the position cursor scrub. The slow dim-pairs (left) take many positions to complete a cycle; the fast pairs (right) wrap many times. Predict: the "Effect on attention" plot shows cos(angle) decaying with distance for high-frequency pairs but staying near 1 for low-frequency pairs. That heterogeneity is what gives RoPE its long-range robustness with NTK scaling.
  4. ALiBi: increase head count to 8. Predict: each head has a different slope for the position penalty (geometric sequence). The model gets a bouquet of "view scales" — some heads only see nearby tokens (steep slope), some see distant ones (gentle slope) — for free, no embedding parameters needed.

Anchored to 06-transformers/positional-encoding from the learning path.