Stage 04 — Language Modeling

A language model assigns probabilities to sequences of tokens. That’s it. Everything else — chat, code, agents — is a wrapper around this single primitive.

This stage walks the historical arc: count-based n-gram models → neural language models → recurrent models → why transformers won.

Prerequisites

Stage 02 (cross-entropy, evaluation)
Stage 03 (RNNs, backprop)

Learning ladder

n-gram models — count, smooth, predict
Neural language models — Bengio 2003, embeddings as a side effect
RNNs & LSTMs — recurrent language models
Why transformers — what RNNs couldn’t do

MVU

You can:

Define a language model formally as P(token_t | tokens_{<t}).
Explain perplexity and compute it on a small example.
Describe the vanishing gradient problem in RNNs and why it limits long-range dependencies.
Argue why parallelism over the sequence dimension was the unlock that enabled transformers to scale.

Exercise

Train a character-level RNN on a 1MB corpus (e.g. tiny Shakespeare). Generate text. Notice:

It learns spelling and basic grammar.
It forgets context past ~50 characters.
It can’t reliably maintain a topic across a paragraph.

This failure is the motivation for everything that follows.

Stage 04 — Language Modeling

Prerequisites

Learning ladder

MVU

Exercise

See also