Stage 04 — Language Modeling
A language model assigns probabilities to sequences of tokens. That’s it. Everything else — chat, code, agents — is a wrapper around this single primitive.
This stage walks the historical arc: count-based n-gram models → neural language models → recurrent models → why transformers won.
Prerequisites
- Stage 02 (cross-entropy, evaluation)
- Stage 03 (RNNs, backprop)
Learning ladder
- n-gram models — count, smooth, predict
- Neural language models — Bengio 2003, embeddings as a side effect
- RNNs & LSTMs — recurrent language models
- Why transformers — what RNNs couldn’t do
MVU
You can:
- Define a language model formally as
P(token_t | tokens_{<t}). - Explain perplexity and compute it on a small example.
- Describe the vanishing gradient problem in RNNs and why it limits long-range dependencies.
- Argue why parallelism over the sequence dimension was the unlock that enabled transformers to scale.
Exercise
Train a character-level RNN on a 1MB corpus (e.g. tiny Shakespeare). Generate text. Notice:
- It learns spelling and basic grammar.
- It forgets context past ~50 characters.
- It can’t reliably maintain a topic across a paragraph.
This failure is the motivation for everything that follows.