demo

Cross-entropy: the loss every model minimizes

Two distributions. Drag the bars. Watch entropy, cross-entropy, and KL divergence update in real time. This is the math at the heart of every neural-network loss function.

The intuition

Entropy H(p) — how many bits of "surprise" does this distribution contain on average?
Cross-entropy H(p, q) — what if you compress p using a code optimized for q? You'll waste bits unless q matches p.
KL(p ‖ q) = H(p, q) − H(p) — exactly how many bits you wasted. Always ≥ 0; zero only when p = q.

Where this shows up

Pre-training: minimize cross-entropy between the model's next-token distribution and the true next token.
RLHF / DPO: KL between the policy and the reference model is the constraint that keeps fine-tuning from drifting.
Distillation: KL between teacher's logits and student's logits.

Anchored to 01-math-foundations/information-theory.