demo
Cross-entropy: the loss every model minimizes
Two distributions. Drag the bars. Watch entropy, cross-entropy, and KL divergence update in real time. This is the math at the heart of every neural-network loss function.
The intuition
- Entropy H(p) — how many bits of "surprise" does this distribution contain on average?
- Cross-entropy H(p, q) — what if you compress p using a code optimized for q? You'll waste bits unless q matches p.
- KL(p ‖ q) = H(p, q) − H(p) — exactly how many bits you wasted. Always ≥ 0; zero only when p = q.
Where this shows up
- Pre-training: minimize cross-entropy between the model's next-token distribution and the true next token.
- RLHF / DPO: KL between the policy and the reference model is the constraint that keeps fine-tuning from drifting.
- Distillation: KL between teacher's logits and student's logits.
Anchored to 01-math-foundations/information-theory.