demo · animated + interactive

Why transformers won

Watch an RNN cell unroll across a sentence. See information from early tokens fade away as the hidden state has to keep compressing more and more context into the same fixed size. The memory bottleneck that motivated self-attention.

The recurrence

An RNN's recipe is simple: h_t = tanh(W_h · h_{t-1} + W_x · x_t). Same weights, applied at every timestep. The hidden state is everything the model "remembers" — and it has to fit in a fixed-size vector.

The bottleneck

Information from early tokens decays exponentially through the recurrence. By step 20, the contribution of token 1 might be 0.01% of the hidden state. This is vanishing gradients in optimization terms, and limited context in capability terms. LSTMs and GRUs ease the bottleneck with gating; transformers eliminate it.

Try this

  1. Drop decay to 0.3. The model forgets almost everything from one step ago.
  2. Bump it to 0.95. Information persists but the hidden state has to encode both the new token and old context — which gets crowded fast.
  3. Compare with Attention Inspector. Same sentence, but every token attends to every other directly. No bottleneck.

Anchored to 04-language-modeling/rnns-and-lstms and 04-language-modeling/why-transformers.