Loss Functions & Optimization

The loss function defines what “wrong” means. The optimizer determines how you fix it. Pick the wrong loss → solve the wrong problem. Pick the wrong optimizer → solve the right problem slowly or never.

Regression losses

Mean Squared Error (MSE)

MSE = (1/n) Σᵢ (yᵢ − ŷᵢ)²

Penalizes large errors quadratically — outliers dominate.
Gradient is linear in the residual: ∂L/∂ŷ = (ŷ − y). Clean to optimize.
Underlying assumption: residuals are Gaussian (MLE under Gaussian noise).

Mean Absolute Error (MAE)

MAE = (1/n) Σᵢ |yᵢ − ŷᵢ|

Robust to outliers.
Gradient is ±1 — doesn’t shrink near the minimum, so SGD oscillates. Often paired with learning-rate decay.

Huber loss

ℓ(r) = (1/2) r²       if |r| ≤ δ
       δ (|r| − δ/2)  otherwise

Quadratic near zero, linear far out. Best of both worlds. Use when you have outliers and care about smooth gradients.

Quantile / pinball loss

For predicting quantiles instead of means. Useful for forecasting with uncertainty intervals.

Classification losses

Cross-entropy (log loss)

For one-hot labels: L = −log(ŷ_true_class).

For binary: L = −[y log ŷ + (1−y) log(1−ŷ)]. (Binary cross-entropy.)

This is the loss for almost every classifier and every language model.

import torch.nn.functional as F
loss = F.cross_entropy(logits, targets)   # logits, not probs

Hinge loss (SVM)

L = max(0, 1 − y · ŷ)

Used for SVMs. Gives a “margin” — predictions far on the right side get zero loss. Largely historical now; cross-entropy is the default.

Focal loss

Cross-entropy with a modulating factor that down-weights easy examples:

FL = −(1−p_t)^γ · log(p_t)

Originally for object detection where most boxes are negatives. Useful whenever you have severe class imbalance.

Embedding / contrastive losses

For learning representations.

Triplet loss

L = max(0, d(anchor, positive) − d(anchor, negative) + margin)

Pull positives close, push negatives apart by at least margin.

InfoNCE / contrastive loss

A softmax over similarities:

L = −log( exp(sim(anchor, positive)/τ) / Σ_k exp(sim(anchor, k)/τ) )

Where the sum is over the positive plus a batch of negatives. Used by CLIP, SimCLR, sentence-transformers. The temperature τ is critical.

Cosine embedding loss

Variants that use cosine similarity directly. Default for sentence-embedding models.

Sequence losses

For language modeling, the loss is per-token cross-entropy, summed (or averaged) across the sequence:

L = (1/T) Σ_t −log P_θ(tokenₜ | tokens_{<t})

Modern LLM training uses this with a few twists (label smoothing, masked positions, etc.).

Regularization terms

Added to the loss to prevent overfitting.

L2 (weight decay): λ Σ wᵢ². Shrinks weights smoothly. Default for transformers.
L1: λ Σ |wᵢ|. Encourages sparsity (some weights → 0). Used for feature selection.
Elastic net: combination of L1 and L2.

In transformers, weight decay is decoupled from gradient updates — that’s what AdamW does. Use AdamW, not Adam, for any modern model.

Optimizers

(Already covered in calculus-and-optimization.md. Quick ML-fundamentals cheat sheet:)

Optimizer	When	Notes
SGD	Vision (CNNs), simple problems	Often best generalization; needs tuning
SGD + momentum	Same	β=0.9 typical
Adam	Default for most things	β₁=0.9, β₂=0.999
AdamW	Transformers, LLMs	Decoupled weight decay
AdaFactor	Memory-constrained LLM training	Lower memory than Adam
Lion, Sophia (2023+)	Frontier LLM pretraining	Sometimes outperforms AdamW

For 95% of work, AdamW with cosine LR schedule and warmup is the right default.

The loss is a contract

Whatever you put in your loss is what the model will learn. This is more subtle than it sounds:

A model trained on MSE for income prediction will under-predict billionaires (MSE punishes them more, so the model hedges).
A classifier trained on cross-entropy without class weights will ignore rare classes.
A language model trained on next-token prediction is only good at predicting the next token. It is not “trying” to be helpful, honest, or correct — those come from RLHF on top.

If the loss doesn’t match what you want, you’re going to be disappointed.

Calibration

A classifier outputs probabilities. Are those probabilities trustworthy? A well-calibrated model that says “70% confident” is right 70% of the time on those inputs.

Cross-entropy training tends to produce uncalibrated, overconfident neural networks. Fixes:

Temperature scaling: divide logits by a learned scalar T on a held-out set.
Label smoothing: target 0.9 for the true class instead of 1.0. Also a regularizer.
Mixup, CutMix: data augmentations that improve calibration as a side effect.

Practical loss-picking guide

Task	Loss
Regression with normal residuals	MSE
Regression with outliers	Huber or MAE
Multi-class classification	Cross-entropy
Binary classification	Binary cross-entropy (BCE)
Imbalanced classification	Cross-entropy with class weights, or focal loss
Multi-label classification	BCE per label
Embedding learning	Contrastive (InfoNCE) or triplet
Language modeling	Cross-entropy on next-token
Ranking	Pairwise (e.g. BPR) or listwise (e.g. listMLE)