demo
Why every inference engine caches
Toggle KV cache on and off, watch the per-step compute count explode. The single biggest reason modern inference is fast — and the single biggest reason it's memory-hungry.
Try this — predict before you click
- With cache OFF, scrub to step 8. Predict: cumulative cells computed = 1+2+3+...+8 = 36. The per-step cost is the step number — quadratic growth in total work.
- Same step 8 with cache ON. Predict: cumulative cells = 8 (one new K + one new V per step). Linear total work, plus the memory tax of holding all 8 K/V pairs in GPU memory.
-
Toggle between modes mid-stream. Predict: the
complexity label flips between
O(N²)andO(N); the speedup ratio is the step number itself. At step 50, cached is 50× faster; at step 1000, 1000× faster. This is why long contexts are unworkable without KV caching. - Look at the K and V matrices. Predict: with caching on, each step adds exactly one new row (the new query's K, V) and reuses every prior row. Without caching, every cell is recomputed each step — the matrix is ephemeral.
Anchored to 06-transformers/transformer-block
and 13-production/cost-and-latency.
Code-side: /ship/14 — cost and latency.