Stage 04 — Language Modeling

A language model assigns probabilities to sequences of tokens. That’s it. Everything else — chat, code, agents — is a wrapper around this single primitive.

This stage walks the historical arc: count-based n-gram models → neural language models → recurrent models → why transformers won.

Prerequisites

  • Stage 02 (cross-entropy, evaluation)
  • Stage 03 (RNNs, backprop)

Learning ladder

  1. n-gram models — count, smooth, predict
  2. Neural language models — Bengio 2003, embeddings as a side effect
  3. RNNs & LSTMs — recurrent language models
  4. Why transformers — what RNNs couldn’t do

MVU

You can:

  • Define a language model formally as P(token_t | tokens_{<t}).
  • Explain perplexity and compute it on a small example.
  • Describe the vanishing gradient problem in RNNs and why it limits long-range dependencies.
  • Argue why parallelism over the sequence dimension was the unlock that enabled transformers to scale.

Exercise

Train a character-level RNN on a 1MB corpus (e.g. tiny Shakespeare). Generate text. Notice:

  • It learns spelling and basic grammar.
  • It forgets context past ~50 characters.
  • It can’t reliably maintain a topic across a paragraph.

This failure is the motivation for everything that follows.

See also