demo
The number every LM paper opens with.
Perplexity is the average branching factor a language model is hesitating between when it predicts the next token. Type any text; score it against three n-gram models trained on different corpora; see why "lower is better" only works inside a tokenizer family.
Try this — predict before you click
-
Score the same text against all three corpora at
n=2. Predict: the corpus-matched model wins by 10–100× on perplexity. Wikipedia text under the Shakespeare model jumps to triple-digit perplexity because it's full of OOV bigrams. - Pick the OOD sample (e.g., the JSON code), score under Shakespeare. Look at the per-token surprisal chips. Predict: the token-level highlights cluster on punctuation and code syntax ({}[]:) because Shakespeare's bigrams never saw them. The high bits-per-token tail dominates the average.
- Same text, slide n from 1 → 3. Predict: at n=1 (unigram), perplexity is high but stable across corpora — just word frequencies. At n=3, the corpus-matched model gets dramatically sharper while the off-corpus models actually get worse (more OOV trigrams). Higher n isn't free.
- Edit a single word in the text to a rare-but-real word (e.g., replace "the" with "perspicacious"). Predict: the OOV badge fires for that token, the chip turns red, and the running average jumps. Even one out-of-vocab token tanks an n-gram model's perplexity.
Anchored to 04-language-modeling/n-gram-models.