Classical ML Algorithms

You can’t appreciate deep learning without seeing what it replaced. And on tabular data, you should usually try these first.

Linear regression

The simplest predictor: ŷ = w · x + b. Fit w, b to minimize MSE.

Closed-form solution: w = (XᵀX)⁻¹ Xᵀy (the normal equations).
Or solve via gradient descent (scales to bigger data).
Assumes linear relationship + Gaussian residuals.
Add interactions (x₁ · x₂) or polynomial features for non-linearity.

When to use: a baseline. If linear regression with reasonable features doesn’t beat random, your features are bad.

Logistic regression

Binary classification: ŷ = sigmoid(w · x + b). Fit by minimizing binary cross-entropy.

Output is a calibrated probability.
No closed-form; solve via gradient descent or Newton’s method.
Multinomial extension via softmax = a one-layer neural network.

A logistic regression with engineered features was the workhorse of industry ML through ~2014.

k-Nearest Neighbors (kNN)

No training. To predict, find the k closest training points and vote (classification) or average (regression).

Simple, intuitive, surprisingly competitive on small data.
Slow at inference (O(n) per query, unless you use indexes).
Curse of dimensionality: in high-dim space, “nearest” becomes meaningless.

Modern echo: vector search in RAG (Stage 09) is approximate kNN over embeddings.

Decision trees

Recursively split the data on the feature/threshold that maximizes information gain (or minimizes Gini impurity / variance).

Capture non-linearities and interactions automatically.
Interpretable (each leaf = a rule path).
High variance — small data changes → very different trees.

Single trees are weak; ensembles of trees are powerful.

Random forests

Train many trees, each on a bootstrap sample with a random feature subset per split. Average their predictions.

Reduces variance dramatically.
Robust default — usually competitive on tabular data with little tuning.
Feature importances built in.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500).fit(X_train, y_train)

Gradient-boosted trees

Build trees sequentially, each fitting the residuals of the ensemble so far.

Fₘ(x) = F_{m-1}(x) + ν · h_m(x)

Where h_m is a tree fit to the gradient of the loss w.r.t. F_{m-1}. ν is a learning rate.

Implementations:

XGBoost: rigorous, fast, the original.
LightGBM: fastest, leaf-wise growth.
CatBoost: best with categorical features, ordered boosting.

On tabular data, gradient-boosted trees beat deep learning most of the time as of 2026. Try them first.

import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, max_depth=-1)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(50)])

Support Vector Machines (SVMs)

Find the hyperplane with maximum margin between classes. With kernel tricks, can do non-linear boundaries.

Powerful on small to medium data.
Don’t scale well to millions of examples (kernel computations are O(n²)).
Largely supplanted by trees and neural nets, but still taught and used in some niches.

Naive Bayes

Apply Bayes’ rule with a strong “naive” assumption: features are conditionally independent given the class.

P(class | features) ∝ P(class) · Π P(featureᵢ | class)

Fast, simple, surprisingly effective on text classification.
Used to be the default spam classifier.

Linear/Quadratic Discriminant Analysis (LDA/QDA)

Generative classifiers assuming Gaussian features per class. LDA assumes shared covariance; QDA fits separate covariances.

Strong baselines when assumptions hold.
LDA is also a dimensionality-reduction technique (project onto axes that maximize class separation).

Clustering algorithms

Recap from unsupervised-learning.md:

k-means, DBSCAN/HDBSCAN, hierarchical clustering, GMM.

Dimensionality reduction

PCA (linear), t-SNE (visualization), UMAP (modern default), autoencoders (neural).

Practical advice

Start with a baseline: linear/logistic regression with engineered features. Beat it before reaching for anything fancy.
For tabular: gradient-boosted trees (LightGBM, XGBoost, CatBoost). Often the right answer.
For unstructured data (text, image, audio): use foundation model embeddings + a simple classifier on top, or fine-tune the foundation model directly.
Hyperparameter search: random search beats grid search; Bayesian optimization (Optuna) beats random search.
Pipelines: scikit-learn Pipeline keeps preprocessing + model together. Use them.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)

When NOT to use classical ML

Unstructured data (text, images, audio): the foundation model era starts here.
When you need to compose components (search → reason → act): agents (Stage 11).
When you need natural language interfaces: LLMs.
When you have huge data and the structure is local (sequences, images): deep learning.

But for “I have a CSV and I want a number,” reach for LightGBM before anything else.