Positional Encoding

Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles with them. Without telling it positions, “the dog bit the man” and “the man bit the dog” are identical to a transformer.

We need to inject position information.

Sinusoidal positional encoding (original)

The 2017 transformer used fixed sinusoidal encodings added to input embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Pros:

Allows extrapolation beyond training context length (in theory).
No parameters.

Cons:

Doesn’t actually extrapolate well in practice.
Hard for the model to learn relative positions from absolute ones.

Learned absolute positional embeddings

Each position gets a learned vector, added to the token embedding:

self.pos_embed = nn.Embedding(max_seq_len, d_model)
x = token_embed(tokens) + self.pos_embed(positions)

Used in GPT-2, GPT-3 (early), BERT.

Pros:

Simple, effective up to training length.

Cons:

Cannot extrapolate beyond training length at all.
Tied to fixed max sequence length.

Relative positional encoding

Instead of “position 47,” encode “this token is 5 ahead of that one.” Multiple flavors:

T5 relative bias: add a learned scalar to attention scores depending on (i − j).
Shaw et al. (2018): learned vectors per relative position, added inside attention.
Transformer-XL / DeBERTa: variations with disentangled position attention.

Better extrapolation than absolute encodings; harder to implement correctly.

RoPE (Rotary Position Embedding)

Su et al. (2021), now the dominant choice. Used in LLaMA, Qwen, Mistral, GPT-NeoX, most modern open models.

The trick: rather than adding position info, rotate Q and K vectors by an angle proportional to position.

For a 2D pair (x_1, x_2) and position m:

[ cos(mθ)  -sin(mθ) ] [ x_1 ]
[ sin(mθ)   cos(mθ) ] [ x_2 ]

Apply this for every pair of dimensions, with frequencies θ_i = 10000^(-2i/d).

Why it’s clever: RoPE(q_m) · RoPE(k_n) = q_m · k_n · f(m − n) — the dot product becomes a function of the relative position alone, but no extra parameters and no extra ops.

def apply_rope(x, cos, sin):
    x1, x2 = x.chunk(2, dim=-1)
    rotated = torch.cat([-x2, x1], dim=-1)
    return x * cos + rotated * sin

RoPE extension techniques

RoPE alone doesn’t generalize past its training length, but it can be extended at inference time without retraining:

Position interpolation (PI): scale positions down (e.g. divide by 4 to fit a 4× longer context). Some quality loss.
NTK scaling: smooth interpolation that preserves high-frequency components.
YaRN: improvement on NTK that better preserves the shape.
LongRoPE: optimized scaling factors per frequency band.

These let you go from 4k → 128k context with minimal fine-tuning.

ALiBi (Attention with Linear Biases)

Press et al. (2022). No positional encoding at all — just add a position-dependent bias to attention scores:

attention_score(i, j) = q_i · k_j − m · |i − j|

Where m is a per-head fixed slope. Closer tokens get higher scores; distant tokens are penalized linearly.

Pros:

Extrapolates well to longer contexts than seen during training.
Simple to implement.

Cons:

Empirically, RoPE beats ALiBi at the frontier in 2026.

Used in BLOOM, MPT, some research models. Less common at the frontier now.

Hybrid and recent approaches

xPos: variant of RoPE with better long-range behavior.
RoPE with high base (e.g. base = 1M instead of 10k): trains for longer effective context.
ALiBi-RoPE hybrids: research-grade combinations.

The frontier is moving toward RoPE with carefully chosen frequency bases plus interpolation/YaRN at inference for context windows of 128k–1M+.

Position IDs in practice

For decoder-only generation, positions are 0, 1, 2, … at every step. With KV caching, each new token gets the next sequential position.

For chat-formatted inputs, positions still increase monotonically through the entire prompt + completion.

For tool calling and multi-turn, you must respect this — interleaved generation requires careful position bookkeeping.

Image and audio “positions”

Transformers also work on non-sequential data:

Vision Transformers (ViT): positions are 2D (row, col) of image patches. Often use 2D RoPE.
Audio transformers: positions are time steps in spectrograms.
Video: 3D (t, x, y) positions.

The same machinery — assign positions, encode them — adapts to any modality.

Practical advice

For new transformer code, use RoPE.
For long context, plan for RoPE + YaRN/PI.
Don’t shuffle inputs through your tokenizer’s position handling — it’s easy to break things if you reorder messages or apply sliding-window inference.
Position embeddings are usually frozen during fine-tuning; if you need to extend context, plan that explicitly.

Positional Encoding

Sinusoidal positional encoding (original)

Learned absolute positional embeddings

Relative positional encoding

RoPE (Rotary Position Embedding)

RoPE extension techniques

ALiBi (Attention with Linear Biases)

Hybrid and recent approaches

Position IDs in practice

Image and audio “positions”

Practical advice

See also