KV Cache · ai-explained

Try this — predict before you click

With cache OFF, scrub to step 8. Predict: cumulative cells computed = 1+2+3+...+8 = 36. The per-step cost is the step number — quadratic growth in total work.
Same step 8 with cache ON. Predict: cumulative cells = 8 (one new K + one new V per step). Linear total work, plus the memory tax of holding all 8 K/V pairs in GPU memory.
Toggle between modes mid-stream. Predict: the complexity label flips between O(N²) and O(N); the speedup ratio is the step number itself. At step 50, cached is 50× faster; at step 1000, 1000× faster. This is why long contexts are unworkable without KV caching.
Look at the K and V matrices. Predict: with caching on, each step adds exactly one new row (the new query's K, V) and reuses every prior row. Without caching, every cell is recomputed each step — the matrix is ephemeral.