Loss Functions & Optimization
The loss function defines what “wrong” means. The optimizer determines how you fix it. Pick the wrong loss → solve the wrong problem. Pick the wrong optimizer → solve the right problem slowly or never.
Regression losses
Mean Squared Error (MSE)
MSE = (1/n) Σᵢ (yᵢ − ŷᵢ)²
- Penalizes large errors quadratically — outliers dominate.
- Gradient is linear in the residual:
∂L/∂ŷ = (ŷ − y). Clean to optimize. - Underlying assumption: residuals are Gaussian (MLE under Gaussian noise).
Mean Absolute Error (MAE)
MAE = (1/n) Σᵢ |yᵢ − ŷᵢ|
- Robust to outliers.
- Gradient is ±1 — doesn’t shrink near the minimum, so SGD oscillates. Often paired with learning-rate decay.
Huber loss
ℓ(r) = (1/2) r² if |r| ≤ δ
δ (|r| − δ/2) otherwise
Quadratic near zero, linear far out. Best of both worlds. Use when you have outliers and care about smooth gradients.
Quantile / pinball loss
For predicting quantiles instead of means. Useful for forecasting with uncertainty intervals.
Classification losses
Cross-entropy (log loss)
For one-hot labels: L = −log(ŷ_true_class).
For binary: L = −[y log ŷ + (1−y) log(1−ŷ)]. (Binary cross-entropy.)
This is the loss for almost every classifier and every language model.
import torch.nn.functional as F
loss = F.cross_entropy(logits, targets) # logits, not probs
Hinge loss (SVM)
L = max(0, 1 − y · ŷ)
Used for SVMs. Gives a “margin” — predictions far on the right side get zero loss. Largely historical now; cross-entropy is the default.
Focal loss
Cross-entropy with a modulating factor that down-weights easy examples:
FL = −(1−p_t)^γ · log(p_t)
Originally for object detection where most boxes are negatives. Useful whenever you have severe class imbalance.
Embedding / contrastive losses
For learning representations.
Triplet loss
L = max(0, d(anchor, positive) − d(anchor, negative) + margin)
Pull positives close, push negatives apart by at least margin.
InfoNCE / contrastive loss
A softmax over similarities:
L = −log( exp(sim(anchor, positive)/τ) / Σ_k exp(sim(anchor, k)/τ) )
Where the sum is over the positive plus a batch of negatives. Used by CLIP, SimCLR, sentence-transformers. The temperature τ is critical.
Cosine embedding loss
Variants that use cosine similarity directly. Default for sentence-embedding models.
Sequence losses
For language modeling, the loss is per-token cross-entropy, summed (or averaged) across the sequence:
L = (1/T) Σ_t −log P_θ(tokenₜ | tokens_{<t})
Modern LLM training uses this with a few twists (label smoothing, masked positions, etc.).
Regularization terms
Added to the loss to prevent overfitting.
- L2 (weight decay):
λ Σ wᵢ². Shrinks weights smoothly. Default for transformers. - L1:
λ Σ |wᵢ|. Encourages sparsity (some weights → 0). Used for feature selection. - Elastic net: combination of L1 and L2.
In transformers, weight decay is decoupled from gradient updates — that’s what AdamW does. Use AdamW, not Adam, for any modern model.
Optimizers
(Already covered in calculus-and-optimization.md. Quick ML-fundamentals cheat sheet:)
| Optimizer | When | Notes |
|---|---|---|
| SGD | Vision (CNNs), simple problems | Often best generalization; needs tuning |
| SGD + momentum | Same | β=0.9 typical |
| Adam | Default for most things | β₁=0.9, β₂=0.999 |
| AdamW | Transformers, LLMs | Decoupled weight decay |
| AdaFactor | Memory-constrained LLM training | Lower memory than Adam |
| Lion, Sophia (2023+) | Frontier LLM pretraining | Sometimes outperforms AdamW |
For 95% of work, AdamW with cosine LR schedule and warmup is the right default.
The loss is a contract
Whatever you put in your loss is what the model will learn. This is more subtle than it sounds:
- A model trained on MSE for income prediction will under-predict billionaires (MSE punishes them more, so the model hedges).
- A classifier trained on cross-entropy without class weights will ignore rare classes.
- A language model trained on next-token prediction is only good at predicting the next token. It is not “trying” to be helpful, honest, or correct — those come from RLHF on top.
If the loss doesn’t match what you want, you’re going to be disappointed.
Calibration
A classifier outputs probabilities. Are those probabilities trustworthy? A well-calibrated model that says “70% confident” is right 70% of the time on those inputs.
Cross-entropy training tends to produce uncalibrated, overconfident neural networks. Fixes:
- Temperature scaling: divide logits by a learned scalar
Ton a held-out set. - Label smoothing: target
0.9for the true class instead of1.0. Also a regularizer. - Mixup, CutMix: data augmentations that improve calibration as a side effect.
Practical loss-picking guide
| Task | Loss |
|---|---|
| Regression with normal residuals | MSE |
| Regression with outliers | Huber or MAE |
| Multi-class classification | Cross-entropy |
| Binary classification | Binary cross-entropy (BCE) |
| Imbalanced classification | Cross-entropy with class weights, or focal loss |
| Multi-label classification | BCE per label |
| Embedding learning | Contrastive (InfoNCE) or triplet |
| Language modeling | Cross-entropy on next-token |
| Ranking | Pairwise (e.g. BPR) or listwise (e.g. listMLE) |
See also
- Calculus & optimization
- Information theory
- Stage 03 — Optimizers
- Stage 10 — RLHF/DPO/GRPO — losses for preference training