Supervised Learning
You have inputs x paired with labels y. You want a function f(x) ≈ y. That’s supervised learning.
The setup
- Training set:
{(x₁, y₁), ..., (xₙ, yₙ)}drawn i.i.d. from a true distributionP(x, y). - Hypothesis class: the family of functions you’ll search over (e.g. linear functions, MLPs).
- Loss: a measure of how wrong
f(x)is vs.y. - Objective: minimize expected loss
E_(x,y)~P [ℓ(f(x), y)]. We approximate the expectation with the training set.
Two flavors
Regression
y is continuous. Predict a number.
- House price prediction
- Revenue forecasting
- Temperature estimation
Common losses: mean squared error (MSE), mean absolute error (MAE), Huber.
Classification
y is discrete. Predict a class.
- Spam vs. ham (binary)
- ImageNet category (1000-way)
- Toxicity detection (multi-label, can have multiple positive classes)
Common loss: cross-entropy.
The train/val/test split
You never evaluate on training data — the model has seen it.
- Training set (60–80%): the model fits to this.
- Validation set (10–20%): used to tune hyperparameters, pick best epoch, compare model variants.
- Test set (10–20%): touched once, at the end, to get an unbiased estimate.
Pitfall — data leakage: if information about the test set sneaks into training (duplicate examples, leaked features, normalizer fit on full data), your test number is a lie. A common cause of “the model worked great in dev but fails in prod.”
For time-series: split by time, not at random. Future data must not influence past training.
Cross-validation
For small datasets, a single split is noisy. k-fold cross-validation trains k models on k different splits and averages.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring="f1")
5-fold and 10-fold are typical. Don’t use this on huge data — it’s expensive and unnecessary.
Generalization
A model that fits training data perfectly but fails on new data has overfit. The test set quantifies this gap.
The bias–variance tradeoff (next stages) is the main lens. In short:
- Bias: model too simple → systematic error (underfit)
- Variance: model too complex → memorizes noise (overfit)
The sweet spot depends on your data and is found via the validation set.
Features matter more than models
Two practical truths:
- A great feature with a bad model often beats a bad feature with a great model.
- Most ML wins in industry come from getting more/better data, not from a fancier algorithm.
Modern deep learning collapses this somewhat — neural nets learn features end-to-end. But for tabular ML, feature engineering remains primary.
Modern variations
- Self-supervised (Stage 04+): no labels needed; the model creates its own targets (e.g. predict the next token).
- Semi-supervised: a little labeled data + a lot of unlabeled.
- Weakly supervised: noisy or imprecise labels.
- Few-shot / zero-shot: large pretrained models (LLMs, CLIP) classify with almost no labeled data.
The shift from classical to modern AI is largely a shift away from “collect 1M labeled examples” toward “use a pretrained foundation model and adapt it.”
Common pitfalls
- Class imbalance. 99% of emails aren’t spam — predict “ham” always and you’re 99% accurate. Use precision/recall/F1, not accuracy.
- Distribution shift. Train on 2024 data, deploy in 2026, performance drops. Monitor in production.
- Label noise. Annotators disagree. Build a small “gold” eval set you trust.
- Overconfidence in held-out scores. A 92% test accuracy on a noisy benchmark with 95% inter-annotator agreement is the ceiling, not your headroom.