Positional Encoding

Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles with them. Without telling it positions, “the dog bit the man” and “the man bit the dog” are identical to a transformer.

We need to inject position information.

Sinusoidal positional encoding (original)

The 2017 transformer used fixed sinusoidal encodings added to input embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Pros:

  • Allows extrapolation beyond training context length (in theory).
  • No parameters.

Cons:

  • Doesn’t actually extrapolate well in practice.
  • Hard for the model to learn relative positions from absolute ones.

Learned absolute positional embeddings

Each position gets a learned vector, added to the token embedding:

self.pos_embed = nn.Embedding(max_seq_len, d_model)
x = token_embed(tokens) + self.pos_embed(positions)

Used in GPT-2, GPT-3 (early), BERT.

Pros:

  • Simple, effective up to training length.

Cons:

  • Cannot extrapolate beyond training length at all.
  • Tied to fixed max sequence length.

Relative positional encoding

Instead of “position 47,” encode “this token is 5 ahead of that one.” Multiple flavors:

  • T5 relative bias: add a learned scalar to attention scores depending on (i − j).
  • Shaw et al. (2018): learned vectors per relative position, added inside attention.
  • Transformer-XL / DeBERTa: variations with disentangled position attention.

Better extrapolation than absolute encodings; harder to implement correctly.

RoPE (Rotary Position Embedding)

Su et al. (2021), now the dominant choice. Used in LLaMA, Qwen, Mistral, GPT-NeoX, most modern open models.

The trick: rather than adding position info, rotate Q and K vectors by an angle proportional to position.

For a 2D pair (x_1, x_2) and position m:

[ cos(mθ)  -sin(mθ) ] [ x_1 ]
[ sin(mθ)   cos(mθ) ] [ x_2 ]

Apply this for every pair of dimensions, with frequencies θ_i = 10000^(-2i/d).

Why it’s clever: RoPE(q_m) · RoPE(k_n) = q_m · k_n · f(m − n) — the dot product becomes a function of the relative position alone, but no extra parameters and no extra ops.

def apply_rope(x, cos, sin):
    x1, x2 = x.chunk(2, dim=-1)
    rotated = torch.cat([-x2, x1], dim=-1)
    return x * cos + rotated * sin

RoPE extension techniques

RoPE alone doesn’t generalize past its training length, but it can be extended at inference time without retraining:

  • Position interpolation (PI): scale positions down (e.g. divide by 4 to fit a 4× longer context). Some quality loss.
  • NTK scaling: smooth interpolation that preserves high-frequency components.
  • YaRN: improvement on NTK that better preserves the shape.
  • LongRoPE: optimized scaling factors per frequency band.

These let you go from 4k → 128k context with minimal fine-tuning.

ALiBi (Attention with Linear Biases)

Press et al. (2022). No positional encoding at all — just add a position-dependent bias to attention scores:

attention_score(i, j) = q_i · k_j − m · |i − j|

Where m is a per-head fixed slope. Closer tokens get higher scores; distant tokens are penalized linearly.

Pros:

  • Extrapolates well to longer contexts than seen during training.
  • Simple to implement.

Cons:

  • Empirically, RoPE beats ALiBi at the frontier in 2026.

Used in BLOOM, MPT, some research models. Less common at the frontier now.

Hybrid and recent approaches

  • xPos: variant of RoPE with better long-range behavior.
  • RoPE with high base (e.g. base = 1M instead of 10k): trains for longer effective context.
  • ALiBi-RoPE hybrids: research-grade combinations.

The frontier is moving toward RoPE with carefully chosen frequency bases plus interpolation/YaRN at inference for context windows of 128k–1M+.

Position IDs in practice

For decoder-only generation, positions are 0, 1, 2, … at every step. With KV caching, each new token gets the next sequential position.

For chat-formatted inputs, positions still increase monotonically through the entire prompt + completion.

For tool calling and multi-turn, you must respect this — interleaved generation requires careful position bookkeeping.

Image and audio “positions”

Transformers also work on non-sequential data:

  • Vision Transformers (ViT): positions are 2D (row, col) of image patches. Often use 2D RoPE.
  • Audio transformers: positions are time steps in spectrograms.
  • Video: 3D (t, x, y) positions.

The same machinery — assign positions, encode them — adapts to any modality.

Practical advice

  1. For new transformer code, use RoPE.
  2. For long context, plan for RoPE + YaRN/PI.
  3. Don’t shuffle inputs through your tokenizer’s position handling — it’s easy to break things if you reorder messages or apply sliding-window inference.
  4. Position embeddings are usually frozen during fine-tuning; if you need to extend context, plan that explicitly.

See also