Positional Encoding
Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles with them. Without telling it positions, “the dog bit the man” and “the man bit the dog” are identical to a transformer.
We need to inject position information.
Sinusoidal positional encoding (original)
The 2017 transformer used fixed sinusoidal encodings added to input embeddings:
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Pros:
- Allows extrapolation beyond training context length (in theory).
- No parameters.
Cons:
- Doesn’t actually extrapolate well in practice.
- Hard for the model to learn relative positions from absolute ones.
Learned absolute positional embeddings
Each position gets a learned vector, added to the token embedding:
self.pos_embed = nn.Embedding(max_seq_len, d_model)
x = token_embed(tokens) + self.pos_embed(positions)
Used in GPT-2, GPT-3 (early), BERT.
Pros:
- Simple, effective up to training length.
Cons:
- Cannot extrapolate beyond training length at all.
- Tied to fixed max sequence length.
Relative positional encoding
Instead of “position 47,” encode “this token is 5 ahead of that one.” Multiple flavors:
- T5 relative bias: add a learned scalar to attention scores depending on
(i − j). - Shaw et al. (2018): learned vectors per relative position, added inside attention.
- Transformer-XL / DeBERTa: variations with disentangled position attention.
Better extrapolation than absolute encodings; harder to implement correctly.
RoPE (Rotary Position Embedding)
Su et al. (2021), now the dominant choice. Used in LLaMA, Qwen, Mistral, GPT-NeoX, most modern open models.
The trick: rather than adding position info, rotate Q and K vectors by an angle proportional to position.
For a 2D pair (x_1, x_2) and position m:
[ cos(mθ) -sin(mθ) ] [ x_1 ]
[ sin(mθ) cos(mθ) ] [ x_2 ]
Apply this for every pair of dimensions, with frequencies θ_i = 10000^(-2i/d).
Why it’s clever: RoPE(q_m) · RoPE(k_n) = q_m · k_n · f(m − n) — the dot product becomes a function of the relative position alone, but no extra parameters and no extra ops.
def apply_rope(x, cos, sin):
x1, x2 = x.chunk(2, dim=-1)
rotated = torch.cat([-x2, x1], dim=-1)
return x * cos + rotated * sin
RoPE extension techniques
RoPE alone doesn’t generalize past its training length, but it can be extended at inference time without retraining:
- Position interpolation (PI): scale positions down (e.g. divide by 4 to fit a 4× longer context). Some quality loss.
- NTK scaling: smooth interpolation that preserves high-frequency components.
- YaRN: improvement on NTK that better preserves the shape.
- LongRoPE: optimized scaling factors per frequency band.
These let you go from 4k → 128k context with minimal fine-tuning.
ALiBi (Attention with Linear Biases)
Press et al. (2022). No positional encoding at all — just add a position-dependent bias to attention scores:
attention_score(i, j) = q_i · k_j − m · |i − j|
Where m is a per-head fixed slope. Closer tokens get higher scores; distant tokens are penalized linearly.
Pros:
- Extrapolates well to longer contexts than seen during training.
- Simple to implement.
Cons:
- Empirically, RoPE beats ALiBi at the frontier in 2026.
Used in BLOOM, MPT, some research models. Less common at the frontier now.
Hybrid and recent approaches
- xPos: variant of RoPE with better long-range behavior.
- RoPE with high base (e.g. base = 1M instead of 10k): trains for longer effective context.
- ALiBi-RoPE hybrids: research-grade combinations.
The frontier is moving toward RoPE with carefully chosen frequency bases plus interpolation/YaRN at inference for context windows of 128k–1M+.
Position IDs in practice
For decoder-only generation, positions are 0, 1, 2, … at every step. With KV caching, each new token gets the next sequential position.
For chat-formatted inputs, positions still increase monotonically through the entire prompt + completion.
For tool calling and multi-turn, you must respect this — interleaved generation requires careful position bookkeeping.
Image and audio “positions”
Transformers also work on non-sequential data:
- Vision Transformers (ViT): positions are 2D
(row, col)of image patches. Often use 2D RoPE. - Audio transformers: positions are time steps in spectrograms.
- Video: 3D
(t, x, y)positions.
The same machinery — assign positions, encode them — adapts to any modality.
Practical advice
- For new transformer code, use RoPE.
- For long context, plan for RoPE + YaRN/PI.
- Don’t shuffle inputs through your tokenizer’s position handling — it’s easy to break things if you reorder messages or apply sliding-window inference.
- Position embeddings are usually frozen during fine-tuning; if you need to extend context, plan that explicitly.