n-gram Models
The first language models. Count-based, statistical, no neural networks. Replaced almost entirely now, but worth understanding because the probabilistic framing they introduced is still the core of modern LLMs.
The chain rule of probability
Any probability of a sequence factors:
P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂ | w₁) · P(w₃ | w₁, w₂) · ... · P(wₙ | w₁..wₙ₋₁)
A language model is a way to compute or estimate P(wₜ | w₁..wₜ₋₁) — the probability of the next word given everything before.
The Markov assumption
Conditioning on the entire history is intractable. n-gram models assume only the previous n−1 tokens matter:
P(wₜ | w₁..wₜ₋₁) ≈ P(wₜ | wₜ₋ₙ₊₁..wₜ₋₁)
- Unigram:
P(wₜ)— no context. - Bigram:
P(wₜ | wₜ₋₁)— previous word. - Trigram:
P(wₜ | wₜ₋₂, wₜ₋₁). - 5-gram: typical for production speech recognition c. 2010.
Estimation by counting
Maximum likelihood estimate from a corpus:
P(wₜ | wₜ₋₁) = count(wₜ₋₁, wₜ) / count(wₜ₋₁)
For a corpus with the cat sat, the cat ran:
count(the, cat) = 2count(the) = 2P(cat | the) = 1.0P(sat | cat) = 0.5
The sparsity problem
For a vocabulary of size V, there are V² possible bigrams, V³ trigrams, V^n n-grams. Even on huge corpora, most n-grams are unseen. Their MLE probability is 0, which:
- Assigns probability 0 to any sequence containing them — a deal-breaker for evaluation (perplexity → ∞).
- Reflects our model’s uncertainty, not actual impossibility.
Smoothing
Redistribute probability mass to unseen n-grams.
Add-one (Laplace) smoothing
P(wₜ | wₜ₋₁) = (count(wₜ₋₁, wₜ) + 1) / (count(wₜ₋₁) + V)
Simple. Generally too aggressive.
Add-k smoothing
Same idea, with k < 1. Better.
Back-off and interpolation
If the trigram count is zero, fall back to bigram. If bigram is zero, fall back to unigram.
Linear interpolation:
P̂(wₜ | wₜ₋₂, wₜ₋₁) = λ₁ P(wₜ) + λ₂ P(wₜ | wₜ₋₁) + λ₃ P(wₜ | wₜ₋₂, wₜ₋₁)
The λ weights are tuned on held-out data.
Kneser-Ney smoothing
The state-of-the-art for n-gram smoothing. Discounts seen counts and redistributes intelligently. Modified Kneser-Ney was the gold standard pre-neural.
Perplexity
The standard intrinsic metric for a language model.
PPL(W) = P(w₁..wₙ)^(−1/N) = exp(− (1/N) Σ log P(wₜ | history))
Interpretation: the model’s effective branching factor. A perplexity of 50 means “as uncertain as picking uniformly from 50 options at each step.”
Lower is better. State-of-the-art trigram models on Penn Treebank: ~140. State-of-the-art transformers (2024+): ~10–20 on the same data.
Why n-grams hit a wall
- Sparsity scales exponentially with
n. You can’t go past 5-grams without astronomical data. - No generalization. “Dog ate my homework” and “cat ate my homework” are unrelated to the model.
- No semantics. Words are atomic symbols; “good” and “great” are as different as “good” and “rotation.”
This is exactly what neural language models fixed.
When n-grams still appear
- Speech recognition decoders: KenLM-style models still appear in low-latency ASR.
- Statistical machine translation: largely replaced by neural MT but lingers.
- Anomaly detection: e.g. detecting unusual command-line invocations.
- As a baseline in benchmarks.
The conceptual contribution
n-gram models gave us:
- The autoregressive framing — predict next token given previous tokens. Modern LLMs are still autoregressive language models.
- Perplexity as a metric.
- Smoothing as an idea — neural models implement it implicitly via embedding similarity.
Every modern LLM you’ll meet is, at the core, doing what n-gram models did: estimating P(wₜ | history). They just do it with billions of parameters and infinite-context attention instead of counting bigrams.
See also
- Neural language models — the next leap
- Information theory — perplexity formalized