n-gram Models

The first language models. Count-based, statistical, no neural networks. Replaced almost entirely now, but worth understanding because the probabilistic framing they introduced is still the core of modern LLMs.

The chain rule of probability

Any probability of a sequence factors:

P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂ | w₁) · P(w₃ | w₁, w₂) · ... · P(wₙ | w₁..wₙ₋₁)

A language model is a way to compute or estimate P(wₜ | w₁..wₜ₋₁) — the probability of the next word given everything before.

The Markov assumption

Conditioning on the entire history is intractable. n-gram models assume only the previous n−1 tokens matter:

P(wₜ | w₁..wₜ₋₁) ≈ P(wₜ | wₜ₋ₙ₊₁..wₜ₋₁)
  • Unigram: P(wₜ) — no context.
  • Bigram: P(wₜ | wₜ₋₁) — previous word.
  • Trigram: P(wₜ | wₜ₋₂, wₜ₋₁).
  • 5-gram: typical for production speech recognition c. 2010.

Estimation by counting

Maximum likelihood estimate from a corpus:

P(wₜ | wₜ₋₁) = count(wₜ₋₁, wₜ) / count(wₜ₋₁)

For a corpus with the cat sat, the cat ran:

  • count(the, cat) = 2
  • count(the) = 2
  • P(cat | the) = 1.0
  • P(sat | cat) = 0.5

The sparsity problem

For a vocabulary of size V, there are possible bigrams, trigrams, V^n n-grams. Even on huge corpora, most n-grams are unseen. Their MLE probability is 0, which:

  1. Assigns probability 0 to any sequence containing them — a deal-breaker for evaluation (perplexity → ∞).
  2. Reflects our model’s uncertainty, not actual impossibility.

Smoothing

Redistribute probability mass to unseen n-grams.

Add-one (Laplace) smoothing

P(wₜ | wₜ₋₁) = (count(wₜ₋₁, wₜ) + 1) / (count(wₜ₋₁) + V)

Simple. Generally too aggressive.

Add-k smoothing

Same idea, with k < 1. Better.

Back-off and interpolation

If the trigram count is zero, fall back to bigram. If bigram is zero, fall back to unigram.

Linear interpolation:

P̂(wₜ | wₜ₋₂, wₜ₋₁) = λ₁ P(wₜ) + λ₂ P(wₜ | wₜ₋₁) + λ₃ P(wₜ | wₜ₋₂, wₜ₋₁)

The λ weights are tuned on held-out data.

Kneser-Ney smoothing

The state-of-the-art for n-gram smoothing. Discounts seen counts and redistributes intelligently. Modified Kneser-Ney was the gold standard pre-neural.

Perplexity

The standard intrinsic metric for a language model.

PPL(W) = P(w₁..wₙ)^(−1/N) = exp(− (1/N) Σ log P(wₜ | history))

Interpretation: the model’s effective branching factor. A perplexity of 50 means “as uncertain as picking uniformly from 50 options at each step.”

Lower is better. State-of-the-art trigram models on Penn Treebank: ~140. State-of-the-art transformers (2024+): ~10–20 on the same data.

Why n-grams hit a wall

  1. Sparsity scales exponentially with n. You can’t go past 5-grams without astronomical data.
  2. No generalization. “Dog ate my homework” and “cat ate my homework” are unrelated to the model.
  3. No semantics. Words are atomic symbols; “good” and “great” are as different as “good” and “rotation.”

This is exactly what neural language models fixed.

When n-grams still appear

  • Speech recognition decoders: KenLM-style models still appear in low-latency ASR.
  • Statistical machine translation: largely replaced by neural MT but lingers.
  • Anomaly detection: e.g. detecting unusual command-line invocations.
  • As a baseline in benchmarks.

The conceptual contribution

n-gram models gave us:

  1. The autoregressive framing — predict next token given previous tokens. Modern LLMs are still autoregressive language models.
  2. Perplexity as a metric.
  3. Smoothing as an idea — neural models implement it implicitly via embedding similarity.

Every modern LLM you’ll meet is, at the core, doing what n-gram models did: estimating P(wₜ | history). They just do it with billions of parameters and infinite-context attention instead of counting bigrams.

See also