n-gram Language Model · ai-explained

How it works

For an n-gram model, P(next word | context) is just count(context + next) / count(context). Build a table, do a lookup, done. No gradients, no embeddings, no neural anything.

Why it lost

Sparsity — most contexts never appear. n-grams have to back off or smooth, and quality degrades fast at high n.
No generalization — "the cat sat on" and "a dog sat on" are unrelated. A neural model would share representations.
Memory — 4-grams on web-scale data is terabytes; transformers compress that into ~1 GB of weights.

Try this — predict before you click

Switch to unigram (n=1) on Shakespeare. Generate. Predict: output collapses to "the the the the" or "and the and the" within 5 tokens — no context = no coherence. The most-frequent words crowd out everything.
Switch to 4-gram on Shakespeare. Predict: for any seen prefix the model regurgitates the source text verbatim, because every 3-word context appeared exactly once during counting. High n on a small corpus = memorization, not generalization.
Type the context the dragon on any corpus. Predict: the prediction list is empty — that bigram never occurred in training, so the count-based denominator is zero. The model has literally nothing to say. Modern neural LMs would smooth over this gracefully; n-grams just stop.
Switch to bigram on the JS code corpus, type function. Predict: the top continuations are ( or a function name. The probabilities are the same count-ratios you'd get if you grep'd the corpus by hand — no learning, just frequency.

Anchored to 04-language-modeling/n-gram-models and 04-language-modeling/why-transformers.

Predicting the next word, the old way

How it works

Why it lost

Try this — predict before you click