04

stage · curriculum

Language Modeling

The historical arc: n-grams → neural LMs → RNNs → transformers. The failure modes of RNNs are the motivation for everything in Stage 06. Train a character-level RNN once and feel why context decays past 50 chars.

4 articles
14 min to read
3 demos
3 books
if you only do one thing

Three architectures, one problem. Why transformers replaced RNNs is the most important transition in the modern field.

Articles in this stage

  1. 01 n-gram Models
  2. 02 Neural Language Models
  3. 03 RNNs and LSTMs as Language Models
  4. 04 Why Transformers Won

Stage 04 — Language Modeling

A language model assigns probabilities to sequences of tokens. That’s it. Everything else — chat, code, agents — is a wrapper around this single primitive.

This stage walks the historical arc: count-based n-gram models → neural language models → recurrent models → why transformers won.

Prerequisites

  • Stage 02 (cross-entropy, evaluation)
  • Stage 03 (RNNs, backprop)

Learning ladder

  1. n-gram models — count, smooth, predict
  2. Neural language models — Bengio 2003, embeddings as a side effect
  3. RNNs & LSTMs — recurrent language models
  4. Why transformers — what RNNs couldn’t do

MVU

You can:

  • Define a language model formally as P(token_t | tokens_{<t}).
  • Explain perplexity and compute it on a small example.
  • Describe the vanishing gradient problem in RNNs and why it limits long-range dependencies.
  • Argue why parallelism over the sequence dimension was the unlock that enabled transformers to scale.

Exercise

Train a character-level RNN on a 1MB corpus (e.g. tiny Shakespeare). Generate text. Notice:

  • It learns spelling and basic grammar.
  • It forgets context past ~50 characters.
  • It can’t reliably maintain a topic across a paragraph.

This failure is the motivation for everything that follows.

See also

Further reading

Books move slower than papers in this field — treat these as foundations, not replacements for the latest research. Real authors, real publishers, real editions. Free badges mark books with author-authorized full text online.

  1. ★ start here
    free

    Speech and Language Processing

    Daniel Jurafsky, James H. Martin

    Stanford (3rd ed. draft), 3rd ed. draft, 2024

    The canonical NLP textbook. Draft chapters free online.

  2. Foundations of Statistical Natural Language Processing cover

    Foundations of Statistical Natural Language Processing

    Christopher D. Manning, Hinrich Schütze

    MIT Press, 1999

    Older but the statistical-NLP intuition is timeless.

  3. Deep Learning coverfree

    Deep Learning

    Ian Goodfellow, Yoshua Bengio, Aaron Courville

    MIT Press, 2016

    The foundational reference. Free online.