Information Theory for ML

Information theory gives us the language for “how much information” and “how surprising.” Every classification loss function comes from it. Every compression algorithm uses it. And it explains, precisely, what a language model is trying to do.

Entropy

The entropy of a distribution P is the expected surprise:

H(P) = − Σ_x P(x) log P(x)

(Conventions: 0 · log 0 = 0. The base of the log determines units — base 2 = bits, base e = nats. ML usually uses nats.)

Intuitions:

Surprise of an event with probability p is −log p. Rare event = big surprise. Certain event = no surprise.
Entropy is the average surprise. A fair coin has entropy 1 bit; a 90/10 coin has ~0.47 bits.
Maximum entropy for K outcomes is log K, achieved by uniform distribution. Adding any structure decreases entropy.

import numpy as np

def entropy(p):
    p = np.asarray(p)
    return -np.sum(p * np.log(p + 1e-12))

entropy([0.5, 0.5])     # ~0.693 nats = 1 bit
entropy([0.9, 0.1])     # ~0.325 nats
entropy([1.0, 0.0])     # 0 (no uncertainty)

Why entropy matters in ML

A language model defines a distribution over the next token. The perplexity of the model on a held-out corpus is:

PPL = exp(H(P_data, P_model))

Lower perplexity = the model puts more probability on what actually came next. This is the most-used intrinsic metric for language modeling.

When you scale a language model from 1B to 70B parameters, perplexity drops smoothly along the scaling laws.

Cross-entropy

The expected number of nats needed to encode samples from P using a code optimized for Q:

H(P, Q) = − Σ_x P(x) log Q(x)

Decomposes as:

H(P, Q) = H(P) + KL(P || Q)

So cross-entropy = entropy of true distribution + how wrong our model is.

In supervised learning, P is one-hot (true label), Q is the model’s predicted distribution. Then:

H(P, Q) = − log Q(true_class)

That’s the cross-entropy loss. Negative log-probability of the true class.

import torch
import torch.nn.functional as F

logits = torch.tensor([2.0, 1.0, 0.1])     # raw model outputs
target = torch.tensor(0)                    # true class index
loss = F.cross_entropy(logits.unsqueeze(0), target.unsqueeze(0))
# Internally: softmax → −log(prob[target])

Pitfall: PyTorch’s cross_entropy takes logits, not probabilities. Don’t softmax before passing them in.

KL divergence

Already introduced in probability-statistics.md:

KL(P || Q) = Σ_x P(x) log(P(x) / Q(x))

In information theory terms: the average extra bits/nats needed to encode samples from P if you optimized your code for Q instead.

Why it matters in modern ML:

VAEs: loss = reconstruction + KL(approximate posterior || prior).
Diffusion models: variational bound involves KL terms at each noise step.
RLHF / DPO / PPO: KL penalty between the fine-tuned model and the reference model — keeps it from drifting too far.
Knowledge distillation: minimize KL between student and teacher.

Forward vs reverse KL

KL(P || Q) and KL(Q || P) behave very differently.

Forward KL KL(data || model) — “mode-covering.” Q must put mass everywhere P does.
Reverse KL KL(model || data) — “mode-seeking.” Q tends to lock onto a single mode of P.

This asymmetry shows up subtly in generative models. Reverse-KL training tends to produce sharper but less diverse outputs.

Mutual information

How much knowing Y tells you about X:

I(X; Y) = KL(P(X, Y) || P(X) · P(Y)) = H(X) − H(X | Y)

If X and Y are independent, I(X; Y) = 0. If Y determines X, I(X; Y) = H(X).

Used in:

Self-supervised learning (InfoNCE, contrastive learning maximizes mutual information between augmented views)
Information bottleneck theory of deep learning (controversial but interesting)

The information-theoretic view of training

When you train a classifier or language model with cross-entropy, you are:

Defining a probabilistic model Q_θ(y | x)
Receiving samples from the true joint P(x, y)
Choosing θ to minimize the expected −log Q_θ(y | x)

This is maximum likelihood estimation (MLE) on the model parameters. It is also minimizing the cross-entropy between data and model. Same thing, two languages.

Implication: a well-trained classifier’s probabilities are calibrated estimates of P(y | x). (In practice, modern deep models are often miscalibrated — they’re overconfident. Calibration is a separate concern; see Stage 02.)

Compression and intelligence

A famous claim: good prediction is good compression. If you can predict the next token well, you can encode it cheaply (Huffman / arithmetic coding). Larger language models compress text more efficiently, which is why they “understand” it better in any meaningful sense.

This is the lens behind Marcus Hutter’s AIXI model, the Hutter Prize, and a thread of arguments about why LLMs work. You don’t need to buy the strong philosophical version to see the connection: the loss function is literally a compression metric.

Practical takeaways

Cross-entropy is your loss. For classification and language modeling.
KL is your regularizer. When you want a model close to a reference distribution.
Perplexity is your metric. For language modeling specifically.
Watch entropy. If model output entropy is collapsing, you have mode collapse / overconfidence. If too high, model isn’t committing.

Exercises

Entropy curve. Plot H(p, 1−p) for p ∈ [0, 1]. Verify it peaks at 0.5.
KL by hand. For P = [0.5, 0.5] and Q = [0.9, 0.1], compute KL(P||Q) and KL(Q||P). Confirm asymmetry.
Cross-entropy = NLL. Pick a model that outputs probabilities [0.3, 0.5, 0.2] for a true class of 1. Compute cross-entropy. Confirm it equals −log(0.5).
Perplexity sanity. A model assigns probability 1/V to every token (uniform). What’s its perplexity? Compare to a typical 7B LLM perplexity (~3–6 on Wikipedia).