Information Theory for ML
Information theory gives us the language for “how much information” and “how surprising.” Every classification loss function comes from it. Every compression algorithm uses it. And it explains, precisely, what a language model is trying to do.
Entropy
The entropy of a distribution P is the expected surprise:
H(P) = − Σ_x P(x) log P(x)
(Conventions: 0 · log 0 = 0. The base of the log determines units — base 2 = bits, base e = nats. ML usually uses nats.)
Intuitions:
- Surprise of an event with probability
pis−log p. Rare event = big surprise. Certain event = no surprise. - Entropy is the average surprise. A fair coin has entropy 1 bit; a 90/10 coin has ~0.47 bits.
- Maximum entropy for K outcomes is
log K, achieved by uniform distribution. Adding any structure decreases entropy.
import numpy as np
def entropy(p):
p = np.asarray(p)
return -np.sum(p * np.log(p + 1e-12))
entropy([0.5, 0.5]) # ~0.693 nats = 1 bit
entropy([0.9, 0.1]) # ~0.325 nats
entropy([1.0, 0.0]) # 0 (no uncertainty)
Why entropy matters in ML
A language model defines a distribution over the next token. The perplexity of the model on a held-out corpus is:
PPL = exp(H(P_data, P_model))
Lower perplexity = the model puts more probability on what actually came next. This is the most-used intrinsic metric for language modeling.
When you scale a language model from 1B to 70B parameters, perplexity drops smoothly along the scaling laws.
Cross-entropy
The expected number of nats needed to encode samples from P using a code optimized for Q:
H(P, Q) = − Σ_x P(x) log Q(x)
Decomposes as:
H(P, Q) = H(P) + KL(P || Q)
So cross-entropy = entropy of true distribution + how wrong our model is.
In supervised learning, P is one-hot (true label), Q is the model’s predicted distribution. Then:
H(P, Q) = − log Q(true_class)
That’s the cross-entropy loss. Negative log-probability of the true class.
import torch
import torch.nn.functional as F
logits = torch.tensor([2.0, 1.0, 0.1]) # raw model outputs
target = torch.tensor(0) # true class index
loss = F.cross_entropy(logits.unsqueeze(0), target.unsqueeze(0))
# Internally: softmax → −log(prob[target])
Pitfall: PyTorch’s
cross_entropytakes logits, not probabilities. Don’t softmax before passing them in.
KL divergence
Already introduced in probability-statistics.md:
KL(P || Q) = Σ_x P(x) log(P(x) / Q(x))
In information theory terms: the average extra bits/nats needed to encode samples from P if you optimized your code for Q instead.
Why it matters in modern ML:
- VAEs: loss = reconstruction + KL(approximate posterior || prior).
- Diffusion models: variational bound involves KL terms at each noise step.
- RLHF / DPO / PPO: KL penalty between the fine-tuned model and the reference model — keeps it from drifting too far.
- Knowledge distillation: minimize KL between student and teacher.
Forward vs reverse KL
KL(P || Q) and KL(Q || P) behave very differently.
- Forward KL
KL(data || model)— “mode-covering.” Q must put mass everywhere P does. - Reverse KL
KL(model || data)— “mode-seeking.” Q tends to lock onto a single mode of P.
This asymmetry shows up subtly in generative models. Reverse-KL training tends to produce sharper but less diverse outputs.
Mutual information
How much knowing Y tells you about X:
I(X; Y) = KL(P(X, Y) || P(X) · P(Y)) = H(X) − H(X | Y)
If X and Y are independent, I(X; Y) = 0. If Y determines X, I(X; Y) = H(X).
Used in:
- Self-supervised learning (InfoNCE, contrastive learning maximizes mutual information between augmented views)
- Information bottleneck theory of deep learning (controversial but interesting)
The information-theoretic view of training
When you train a classifier or language model with cross-entropy, you are:
- Defining a probabilistic model
Q_θ(y | x) - Receiving samples from the true joint
P(x, y) - Choosing
θto minimize the expected−log Q_θ(y | x)
This is maximum likelihood estimation (MLE) on the model parameters. It is also minimizing the cross-entropy between data and model. Same thing, two languages.
Implication: a well-trained classifier’s probabilities are calibrated estimates of P(y | x). (In practice, modern deep models are often miscalibrated — they’re overconfident. Calibration is a separate concern; see Stage 02.)
Compression and intelligence
A famous claim: good prediction is good compression. If you can predict the next token well, you can encode it cheaply (Huffman / arithmetic coding). Larger language models compress text more efficiently, which is why they “understand” it better in any meaningful sense.
This is the lens behind Marcus Hutter’s AIXI model, the Hutter Prize, and a thread of arguments about why LLMs work. You don’t need to buy the strong philosophical version to see the connection: the loss function is literally a compression metric.
Practical takeaways
- Cross-entropy is your loss. For classification and language modeling.
- KL is your regularizer. When you want a model close to a reference distribution.
- Perplexity is your metric. For language modeling specifically.
- Watch entropy. If model output entropy is collapsing, you have mode collapse / overconfidence. If too high, model isn’t committing.
Exercises
- Entropy curve. Plot
H(p, 1−p)forp ∈ [0, 1]. Verify it peaks at 0.5. - KL by hand. For
P = [0.5, 0.5]andQ = [0.9, 0.1], computeKL(P||Q)andKL(Q||P). Confirm asymmetry. - Cross-entropy = NLL. Pick a model that outputs probabilities
[0.3, 0.5, 0.2]for a true class of 1. Compute cross-entropy. Confirm it equals−log(0.5). - Perplexity sanity. A model assigns probability 1/V to every token (uniform). What’s its perplexity? Compare to a typical 7B LLM perplexity (~3–6 on Wikipedia).
See also
- Probability & statistics
- Stage 02 — Loss functions
- Stage 04 — Language models
- Stage 10 — RLHF — where the KL penalty lives