Loss Functions & Optimization

The loss function defines what “wrong” means. The optimizer determines how you fix it. Pick the wrong loss → solve the wrong problem. Pick the wrong optimizer → solve the right problem slowly or never.

Regression losses

Mean Squared Error (MSE)

MSE = (1/n) Σᵢ (yᵢ − ŷᵢ)²
  • Penalizes large errors quadratically — outliers dominate.
  • Gradient is linear in the residual: ∂L/∂ŷ = (ŷ − y). Clean to optimize.
  • Underlying assumption: residuals are Gaussian (MLE under Gaussian noise).

Mean Absolute Error (MAE)

MAE = (1/n) Σᵢ |yᵢ − ŷᵢ|
  • Robust to outliers.
  • Gradient is ±1 — doesn’t shrink near the minimum, so SGD oscillates. Often paired with learning-rate decay.

Huber loss

ℓ(r) = (1/2) r²       if |r| ≤ δ
       δ (|r| − δ/2)  otherwise

Quadratic near zero, linear far out. Best of both worlds. Use when you have outliers and care about smooth gradients.

Quantile / pinball loss

For predicting quantiles instead of means. Useful for forecasting with uncertainty intervals.

Classification losses

Cross-entropy (log loss)

For one-hot labels: L = −log(ŷ_true_class).

For binary: L = −[y log ŷ + (1−y) log(1−ŷ)]. (Binary cross-entropy.)

This is the loss for almost every classifier and every language model.

import torch.nn.functional as F
loss = F.cross_entropy(logits, targets)   # logits, not probs

Hinge loss (SVM)

L = max(0, 1 − y · ŷ)

Used for SVMs. Gives a “margin” — predictions far on the right side get zero loss. Largely historical now; cross-entropy is the default.

Focal loss

Cross-entropy with a modulating factor that down-weights easy examples:

FL = −(1−p_t)^γ · log(p_t)

Originally for object detection where most boxes are negatives. Useful whenever you have severe class imbalance.

Embedding / contrastive losses

For learning representations.

Triplet loss

L = max(0, d(anchor, positive) − d(anchor, negative) + margin)

Pull positives close, push negatives apart by at least margin.

InfoNCE / contrastive loss

A softmax over similarities:

L = −log( exp(sim(anchor, positive)/τ) / Σ_k exp(sim(anchor, k)/τ) )

Where the sum is over the positive plus a batch of negatives. Used by CLIP, SimCLR, sentence-transformers. The temperature τ is critical.

Cosine embedding loss

Variants that use cosine similarity directly. Default for sentence-embedding models.

Sequence losses

For language modeling, the loss is per-token cross-entropy, summed (or averaged) across the sequence:

L = (1/T) Σ_t −log P_θ(tokenₜ | tokens_{<t})

Modern LLM training uses this with a few twists (label smoothing, masked positions, etc.).

Regularization terms

Added to the loss to prevent overfitting.

  • L2 (weight decay): λ Σ wᵢ². Shrinks weights smoothly. Default for transformers.
  • L1: λ Σ |wᵢ|. Encourages sparsity (some weights → 0). Used for feature selection.
  • Elastic net: combination of L1 and L2.

In transformers, weight decay is decoupled from gradient updates — that’s what AdamW does. Use AdamW, not Adam, for any modern model.

Optimizers

(Already covered in calculus-and-optimization.md. Quick ML-fundamentals cheat sheet:)

OptimizerWhenNotes
SGDVision (CNNs), simple problemsOften best generalization; needs tuning
SGD + momentumSameβ=0.9 typical
AdamDefault for most thingsβ₁=0.9, β₂=0.999
AdamWTransformers, LLMsDecoupled weight decay
AdaFactorMemory-constrained LLM trainingLower memory than Adam
Lion, Sophia (2023+)Frontier LLM pretrainingSometimes outperforms AdamW

For 95% of work, AdamW with cosine LR schedule and warmup is the right default.

The loss is a contract

Whatever you put in your loss is what the model will learn. This is more subtle than it sounds:

  • A model trained on MSE for income prediction will under-predict billionaires (MSE punishes them more, so the model hedges).
  • A classifier trained on cross-entropy without class weights will ignore rare classes.
  • A language model trained on next-token prediction is only good at predicting the next token. It is not “trying” to be helpful, honest, or correct — those come from RLHF on top.

If the loss doesn’t match what you want, you’re going to be disappointed.

Calibration

A classifier outputs probabilities. Are those probabilities trustworthy? A well-calibrated model that says “70% confident” is right 70% of the time on those inputs.

Cross-entropy training tends to produce uncalibrated, overconfident neural networks. Fixes:

  • Temperature scaling: divide logits by a learned scalar T on a held-out set.
  • Label smoothing: target 0.9 for the true class instead of 1.0. Also a regularizer.
  • Mixup, CutMix: data augmentations that improve calibration as a side effect.

Practical loss-picking guide

TaskLoss
Regression with normal residualsMSE
Regression with outliersHuber or MAE
Multi-class classificationCross-entropy
Binary classificationBinary cross-entropy (BCE)
Imbalanced classificationCross-entropy with class weights, or focal loss
Multi-label classificationBCE per label
Embedding learningContrastive (InfoNCE) or triplet
Language modelingCross-entropy on next-token
RankingPairwise (e.g. BPR) or listwise (e.g. listMLE)

See also