demo · animated + interactive
Why transformers won
Watch an RNN cell unroll across a sentence. See information from early tokens fade away as the hidden state has to keep compressing more and more context into the same fixed size. The memory bottleneck that motivated self-attention.
The recurrence
An RNN's recipe is simple: h_t = tanh(W_h · h_{t-1} + W_x · x_t).
Same weights, applied at every timestep. The hidden state is
everything the model "remembers" — and it has to fit in a
fixed-size vector.
The bottleneck
Information from early tokens decays exponentially through the recurrence. By step 20, the contribution of token 1 might be 0.01% of the hidden state. This is vanishing gradients in optimization terms, and limited context in capability terms. LSTMs and GRUs ease the bottleneck with gating; transformers eliminate it.
Try this
- Drop decay to
0.3. The model forgets almost everything from one step ago. - Bump it to
0.95. Information persists but the hidden state has to encode both the new token and old context — which gets crowded fast. - Compare with Attention Inspector. Same sentence, but every token attends to every other directly. No bottleneck.
Anchored to 04-language-modeling/rnns-and-lstms
and 04-language-modeling/why-transformers.