demo
Beating the O(N²) wall
Naive attention costs scale as the square of context length. Modern long-context models use sliding windows, sparse patterns, and global anchors to keep the cost reasonable. Visualize each as an attention mask.
The cost math
# active attention cells per pattern (out of N² possible):
full attention: N² # the wall
sliding window (w): N · (2w + 1) # linear in N, w is fixed (~4096)
strided (stride s): N · (N / s) # quadratic but smaller constant
global + sliding: N · (2w + 1) + g · N
# extra g "anchors" attend to all
combined (Longformer): linear if w and g are fixed Try this — predict before you click
- Pick full attention, set seq=64. Predict: all 4,096 cells active (or 2,080 with causal mask). This is the wall — quadratic in N, the reason 32K context cost 128× more than 2K context.
- Switch to sliding window with w=4. Predict: active cells drop to ~9·N (≤ 576 for N=64). Mask becomes a thick diagonal band. Cost is now LINEAR in N — but distant tokens can't attend to each other directly.
- Switch to global + sliding with w=4, global=2. Predict: the band stays, plus 2 vertical full columns (the global tokens) and 2 horizontal full rows. Distant tokens can now reach each other through global "anchors" — like BOS/EOS tokens or Longformer's [CLS]-style hub.
- On sliding window, set w=1 (smallest possible). Predict: each token can only attend to itself + its immediate neighbors. Information propagation across the sequence requires ⌈N/(2w+1)⌉ layers — which is why narrow sliding windows need deep stacks.
Anchored to 07-modern-llms/long-context.
Code-side: /ship/14 — cost and latency.