demo

Beating the O(N²) wall

Naive attention costs scale as the square of context length. Modern long-context models use sliding windows, sparse patterns, and global anchors to keep the cost reasonable. Visualize each as an attention mask.

The cost math

# active attention cells per pattern (out of N² possible):

full attention:       N²              # the wall
sliding window (w):   N · (2w + 1)    # linear in N, w is fixed (~4096)
strided (stride s):   N · (N / s)     # quadratic but smaller constant
global + sliding:     N · (2w + 1) + g · N
                                       # extra g "anchors" attend to all
combined (Longformer): linear if w and g are fixed

Try this — predict before you click

  1. Pick full attention, set seq=64. Predict: all 4,096 cells active (or 2,080 with causal mask). This is the wall — quadratic in N, the reason 32K context cost 128× more than 2K context.
  2. Switch to sliding window with w=4. Predict: active cells drop to ~9·N (≤ 576 for N=64). Mask becomes a thick diagonal band. Cost is now LINEAR in N — but distant tokens can't attend to each other directly.
  3. Switch to global + sliding with w=4, global=2. Predict: the band stays, plus 2 vertical full columns (the global tokens) and 2 horizontal full rows. Distant tokens can now reach each other through global "anchors" — like BOS/EOS tokens or Longformer's [CLS]-style hub.
  4. On sliding window, set w=1 (smallest possible). Predict: each token can only attend to itself + its immediate neighbors. Information propagation across the sequence requires ⌈N/(2w+1)⌉ layers — which is why narrow sliding windows need deep stacks.

Anchored to 07-modern-llms/long-context. Code-side: /ship/14 — cost and latency.