demo

Predicting the next word, the old way

Before transformers, before RNNs, before neural language models at all — we predicted the next word by counting. Pick a corpus, pick an n, watch a 1980s-era language model in action.

How it works

For an n-gram model, P(next word | context) is just count(context + next) / count(context). Build a table, do a lookup, done. No gradients, no embeddings, no neural anything.

Why it lost

  • Sparsity — most contexts never appear. n-grams have to back off or smooth, and quality degrades fast at high n.
  • No generalization — "the cat sat on" and "a dog sat on" are unrelated. A neural model would share representations.
  • Memory — 4-grams on web-scale data is terabytes; transformers compress that into ~1 GB of weights.

Try this — predict before you click

  1. Switch to unigram (n=1) on Shakespeare. Generate. Predict: output collapses to "the the the the" or "and the and the" within 5 tokens — no context = no coherence. The most-frequent words crowd out everything.
  2. Switch to 4-gram on Shakespeare. Predict: for any seen prefix the model regurgitates the source text verbatim, because every 3-word context appeared exactly once during counting. High n on a small corpus = memorization, not generalization.
  3. Type the context the dragon on any corpus. Predict: the prediction list is empty — that bigram never occurred in training, so the count-based denominator is zero. The model has literally nothing to say. Modern neural LMs would smooth over this gracefully; n-grams just stop.
  4. Switch to bigram on the JS code corpus, type function. Predict: the top continuations are ( or a function name. The probabilities are the same count-ratios you'd get if you grep'd the corpus by hand — no learning, just frequency.

Anchored to 04-language-modeling/n-gram-models and 04-language-modeling/why-transformers.