Supervised Learning

You have inputs x paired with labels y. You want a function f(x) ≈ y. That’s supervised learning.

The setup

Training set: {(x₁, y₁), ..., (xₙ, yₙ)} drawn i.i.d. from a true distribution P(x, y).
Hypothesis class: the family of functions you’ll search over (e.g. linear functions, MLPs).
Loss: a measure of how wrong f(x) is vs. y.
Objective: minimize expected loss E_(x,y)~P [ℓ(f(x), y)]. We approximate the expectation with the training set.

Two flavors

Regression

y is continuous. Predict a number.

House price prediction
Revenue forecasting
Temperature estimation

Common losses: mean squared error (MSE), mean absolute error (MAE), Huber.

Classification

y is discrete. Predict a class.

Spam vs. ham (binary)
ImageNet category (1000-way)
Toxicity detection (multi-label, can have multiple positive classes)

Common loss: cross-entropy.

The train/val/test split

You never evaluate on training data — the model has seen it.

Training set (60–80%): the model fits to this.
Validation set (10–20%): used to tune hyperparameters, pick best epoch, compare model variants.
Test set (10–20%): touched once, at the end, to get an unbiased estimate.

Pitfall — data leakage: if information about the test set sneaks into training (duplicate examples, leaked features, normalizer fit on full data), your test number is a lie. A common cause of “the model worked great in dev but fails in prod.”

For time-series: split by time, not at random. Future data must not influence past training.

Cross-validation

For small datasets, a single split is noisy. k-fold cross-validation trains k models on k different splits and averages.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring="f1")

5-fold and 10-fold are typical. Don’t use this on huge data — it’s expensive and unnecessary.

Generalization

A model that fits training data perfectly but fails on new data has overfit. The test set quantifies this gap.

The bias–variance tradeoff (next stages) is the main lens. In short:

Bias: model too simple → systematic error (underfit)
Variance: model too complex → memorizes noise (overfit)

The sweet spot depends on your data and is found via the validation set.

Features matter more than models

Two practical truths:

A great feature with a bad model often beats a bad feature with a great model.
Most ML wins in industry come from getting more/better data, not from a fancier algorithm.

Modern deep learning collapses this somewhat — neural nets learn features end-to-end. But for tabular ML, feature engineering remains primary.

Modern variations

Self-supervised (Stage 04+): no labels needed; the model creates its own targets (e.g. predict the next token).
Semi-supervised: a little labeled data + a lot of unlabeled.
Weakly supervised: noisy or imprecise labels.
Few-shot / zero-shot: large pretrained models (LLMs, CLIP) classify with almost no labeled data.

The shift from classical to modern AI is largely a shift away from “collect 1M labeled examples” toward “use a pretrained foundation model and adapt it.”

Common pitfalls

Class imbalance. 99% of emails aren’t spam — predict “ham” always and you’re 99% accurate. Use precision/recall/F1, not accuracy.
Distribution shift. Train on 2024 data, deploy in 2026, performance drops. Monitor in production.
Label noise. Annotators disagree. Build a small “gold” eval set you trust.
Overconfidence in held-out scores. A 92% test accuracy on a noisy benchmark with 95% inter-annotator agreement is the ceiling, not your headroom.