RNN Unrolling · ai-explained

The recurrence

An RNN's recipe is simple: h_t = tanh(W_h · h_{t-1} + W_x · x_t). Same weights, applied at every timestep. The hidden state is everything the model "remembers" — and it has to fit in a fixed-size vector.

The bottleneck

Information from early tokens decays exponentially through the recurrence. By step 20, the contribution of token 1 might be 0.01% of the hidden state. This is vanishing gradients in optimization terms, and limited context in capability terms. LSTMs and GRUs ease the bottleneck with gating; transformers eliminate it.

Try this

Drop decay to 0.3. The model forgets almost everything from one step ago.
Bump it to 0.95. Information persists but the hidden state has to encode both the new token and old context — which gets crowded fast.
Compare with Attention Inspector. Same sentence, but every token attends to every other directly. No bottleneck.

Anchored to 04-language-modeling/rnns-and-lstms and 04-language-modeling/why-transformers.

Why transformers won

The recurrence

The bottleneck

Try this