demo

Why every inference engine caches

Toggle KV cache on and off, watch the per-step compute count explode. The single biggest reason modern inference is fast — and the single biggest reason it's memory-hungry.

Try this — predict before you click

  1. With cache OFF, scrub to step 8. Predict: cumulative cells computed = 1+2+3+...+8 = 36. The per-step cost is the step number — quadratic growth in total work.
  2. Same step 8 with cache ON. Predict: cumulative cells = 8 (one new K + one new V per step). Linear total work, plus the memory tax of holding all 8 K/V pairs in GPU memory.
  3. Toggle between modes mid-stream. Predict: the complexity label flips between O(N²) and O(N); the speedup ratio is the step number itself. At step 50, cached is 50× faster; at step 1000, 1000× faster. This is why long contexts are unworkable without KV caching.
  4. Look at the K and V matrices. Predict: with caching on, each step adds exactly one new row (the new query's K, V) and reuses every prior row. Without caching, every cell is recomputed each step — the matrix is ephemeral.

Anchored to 06-transformers/transformer-block and 13-production/cost-and-latency. Code-side: /ship/14 — cost and latency.