demo
"This token is third from the start"
Self-attention is permutation-invariant — without help, it can't
tell "dog bites man" from "man bites dog".
Position encoding is the trick that fixes it. Four schemes; same
problem; very different solutions.
The four schemes at a glance
- Sinusoidal (Vaswani 2017) — fixed sin/cos at many frequencies. Added directly to the embedding. The original Transformer used this; it has the elegant property of encoding relative positions linearly.
- Learned (GPT-2, GPT-3) — just a parameter table. Position 0 has its own vector; position 1 has its own vector; etc. Easiest to reason about, but the model only knows positions it saw during training.
- RoPE (Su et al. 2021) — rotates each pair of Q/K dimensions by an angle proportional to position. After the dot product, the result depends only on relative position. LLaMA, GPT-NeoX, Qwen, and most modern open-weights models use this.
- ALiBi (Press et al. 2022) — no embedding change at all; just adds a per-head linear bias to attention scores: closer keys are favoured, farther ones penalized. The model literally never has to know its position; it just has a gradient toward "look nearby".
Why this matters
Position encoding is one of the great underrated levers in transformer design. It's the difference between a model that can extrapolate to 1M tokens (RoPE with NTK / YARN scaling) and one that breaks at 2049 tokens (vanilla learned). It's why modern long-context models all use rotary or bias schemes.
Try this — predict before you click
- Sinusoidal: Move seq length from 32 to 128. Predict: the leftmost columns (low frequency) keep waving slowly without breaking — they'd extrapolate cleanly. Right columns oscillate so fast at high positions they look like noise. That separation is what lets sinusoidal encode "near vs far".
- Learned: The pattern is just seeded random noise — because that's what an untrained Learned table looks like. Predict: increase seq length past your "training" length and the model has no signal there at all. This is why GPT-2 breaks at 1024 tokens — it never saw position 1025.
- RoPE: watch the position cursor scrub. The slow dim-pairs (left) take many positions to complete a cycle; the fast pairs (right) wrap many times. Predict: the "Effect on attention" plot shows cos(angle) decaying with distance for high-frequency pairs but staying near 1 for low-frequency pairs. That heterogeneity is what gives RoPE its long-range robustness with NTK scaling.
- ALiBi: increase head count to 8. Predict: each head has a different slope for the position penalty (geometric sequence). The model gets a bouquet of "view scales" — some heads only see nearby tokens (steep slope), some see distant ones (gentle slope) — for free, no embedding parameters needed.
Anchored to 06-transformers/positional-encoding
from the learning path.