RNNs and LSTMs as Language Models

Feedforward neural language models had a fixed context window. RNNs gave us, in principle, unlimited context — a hidden state that carries information forward indefinitely.

RNN language model

Unrolled in time:

h_t = tanh(W_h h_{t-1} + W_x e_{w_t} + b)
o_t = W_o h_t
P(w_{t+1} | w_1..w_t) = softmax(o_t)

For each time step:

Look up the embedding e_{w_t} of the current word.
Update hidden state with previous hidden state + current input.
Project to vocabulary, softmax to get next-token distribution.

Trained with teacher forcing: at each step, feed the true previous token (not the model’s prediction) as input.

Loss: cross-entropy summed over the sequence.

What worked

Variable-length context: in principle, the hidden state remembers the entire history.
Compact: a fixed number of parameters regardless of sequence length.
Better than n-gram and feedforward neural LMs for medium-context tasks.

What didn’t

Vanishing gradients through time: backprop through 100 steps multiplies 100 small derivatives, each shrinking the signal. Long-range dependencies are barely learnable.
Inability to look back precisely: the hidden state is a fixed-size summary; subtle long-range references get blurred.
Sequential by construction: can’t parallelize over the time axis. Slow on modern GPUs.
Hard to train: exploding gradients, instability, finicky learning rates.

LSTMs

Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997 — used at scale c. 2013–2017) introduced a cell state with multiplicative gating.

Architecture

f = σ(W_f [x_t, h_{t-1}])     # forget gate
i = σ(W_i [x_t, h_{t-1}])     # input gate
g = tanh(W_g [x_t, h_{t-1}])  # candidate values
c_t = f ⊙ c_{t-1} + i ⊙ g     # cell state update
o = σ(W_o [x_t, h_{t-1}])     # output gate
h_t = o ⊙ tanh(c_t)           # hidden state

Why it works

The cell state’s update is a combination of:

A scaled-down copy of the previous cell state (controlled by the forget gate)
New information added (controlled by the input gate)

If the forget gate is near 1 and the input gate is near 0, information can flow through unchanged across many time steps. The gradients flow with it.

This was a major engineering win and made LSTMs the dominant sequence model from ~2014 to ~2017.

GRUs

Gated Recurrent Units — a simpler variant with two gates instead of three. Often comparable quality, fewer parameters, slightly faster.

Bidirectional LSTMs

Run two LSTMs (forward, backward); concatenate their hidden states. Useful for sequence labeling, QA — anywhere you have the full input at training time. Not usable for autoregressive generation.

Seq2seq

For generation tasks (translation, summarization), you stack an encoder LSTM (reads input → produces context vector) and a decoder LSTM (starts from context, generates output).

INPUT → encoder → final hidden state → decoder → OUTPUT

The original neural machine translation paper (Sutskever et al., 2014) used this. Worked for sentences; struggled with paragraphs because all input information had to squeeze through one final hidden state.

Attention (Bahdanau, 2015)

Instead of passing only the encoder’s final state, let the decoder attend to all encoder hidden states at each step:

α_{t,i} = softmax(score(decoder_state_t, encoder_state_i))
context_t = Σ_i α_{t,i} · encoder_state_i

The decoder, at each step, computes a weighted average of encoder states based on relevance.

This was the seed of the transformer. Within two years, the natural follow-up question — “what if we get rid of recurrence and use only attention?” — produced “Attention is All You Need” (2017).

Why we mostly moved past RNNs

Sequential bottleneck. Training a transformer = matrix multiplications over the whole sequence. Training an RNN = one timestep at a time.
Long-range dependencies. Attention reaches anywhere directly; RNNs hop step-by-step.
Scaling. Transformers absorb scale (data, parameters, compute). RNNs don’t scale as gracefully.
Stability and hyperparameter robustness. Transformers train more reliably.

RNNs in 2026

Not dead, but specialized:

Mamba and state-space models — recurrent-style architectures with parallelizable training, competitive with transformers on some tasks.
xLSTM — modernized LSTM with parallel training tricks.
RWKV — RNN-style inference with transformer-style training. Fast inference, decent quality.
Linear attention variants — somewhere between transformers and RNNs.
Speech recognition — some pipelines still mix RNNs/CTC.

For most language tasks, transformers still win. But the recurrent thread is alive and worth watching.

Exercises

Char-level RNN: train a 1-layer LSTM on Tiny Shakespeare. Generate text. Notice short-range coherence and long-range incoherence.
Compare RNN to transformer: train a 2-layer transformer on the same task. Compare quality with similar parameter count.
Truncated BPTT: explain why RNN training uses truncated backpropagation through time — and why transformers don’t need this.
Inference latency: measure tokens/sec for an RNN vs a transformer at sequence length 1024.