Classical ML Algorithms

You can’t appreciate deep learning without seeing what it replaced. And on tabular data, you should usually try these first.

Linear regression

The simplest predictor: ŷ = w · x + b. Fit w, b to minimize MSE.

  • Closed-form solution: w = (XᵀX)⁻¹ Xᵀy (the normal equations).
  • Or solve via gradient descent (scales to bigger data).
  • Assumes linear relationship + Gaussian residuals.
  • Add interactions (x₁ · x₂) or polynomial features for non-linearity.

When to use: a baseline. If linear regression with reasonable features doesn’t beat random, your features are bad.

Logistic regression

Binary classification: ŷ = sigmoid(w · x + b). Fit by minimizing binary cross-entropy.

  • Output is a calibrated probability.
  • No closed-form; solve via gradient descent or Newton’s method.
  • Multinomial extension via softmax = a one-layer neural network.

A logistic regression with engineered features was the workhorse of industry ML through ~2014.

k-Nearest Neighbors (kNN)

No training. To predict, find the k closest training points and vote (classification) or average (regression).

  • Simple, intuitive, surprisingly competitive on small data.
  • Slow at inference (O(n) per query, unless you use indexes).
  • Curse of dimensionality: in high-dim space, “nearest” becomes meaningless.

Modern echo: vector search in RAG (Stage 09) is approximate kNN over embeddings.

Decision trees

Recursively split the data on the feature/threshold that maximizes information gain (or minimizes Gini impurity / variance).

  • Capture non-linearities and interactions automatically.
  • Interpretable (each leaf = a rule path).
  • High variance — small data changes → very different trees.

Single trees are weak; ensembles of trees are powerful.

Random forests

Train many trees, each on a bootstrap sample with a random feature subset per split. Average their predictions.

  • Reduces variance dramatically.
  • Robust default — usually competitive on tabular data with little tuning.
  • Feature importances built in.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500).fit(X_train, y_train)

Gradient-boosted trees

Build trees sequentially, each fitting the residuals of the ensemble so far.

Fₘ(x) = F_{m-1}(x) + ν · h_m(x)

Where h_m is a tree fit to the gradient of the loss w.r.t. F_{m-1}. ν is a learning rate.

Implementations:

  • XGBoost: rigorous, fast, the original.
  • LightGBM: fastest, leaf-wise growth.
  • CatBoost: best with categorical features, ordered boosting.

On tabular data, gradient-boosted trees beat deep learning most of the time as of 2026. Try them first.

import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, max_depth=-1)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(50)])

Support Vector Machines (SVMs)

Find the hyperplane with maximum margin between classes. With kernel tricks, can do non-linear boundaries.

  • Powerful on small to medium data.
  • Don’t scale well to millions of examples (kernel computations are O(n²)).
  • Largely supplanted by trees and neural nets, but still taught and used in some niches.

Naive Bayes

Apply Bayes’ rule with a strong “naive” assumption: features are conditionally independent given the class.

P(class | features) ∝ P(class) · Π P(featureᵢ | class)
  • Fast, simple, surprisingly effective on text classification.
  • Used to be the default spam classifier.

Linear/Quadratic Discriminant Analysis (LDA/QDA)

Generative classifiers assuming Gaussian features per class. LDA assumes shared covariance; QDA fits separate covariances.

  • Strong baselines when assumptions hold.
  • LDA is also a dimensionality-reduction technique (project onto axes that maximize class separation).

Clustering algorithms

Recap from unsupervised-learning.md:

  • k-means, DBSCAN/HDBSCAN, hierarchical clustering, GMM.

Dimensionality reduction

  • PCA (linear), t-SNE (visualization), UMAP (modern default), autoencoders (neural).

Practical advice

  1. Start with a baseline: linear/logistic regression with engineered features. Beat it before reaching for anything fancy.
  2. For tabular: gradient-boosted trees (LightGBM, XGBoost, CatBoost). Often the right answer.
  3. For unstructured data (text, image, audio): use foundation model embeddings + a simple classifier on top, or fine-tune the foundation model directly.
  4. Hyperparameter search: random search beats grid search; Bayesian optimization (Optuna) beats random search.
  5. Pipelines: scikit-learn Pipeline keeps preprocessing + model together. Use them.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)

When NOT to use classical ML

  • Unstructured data (text, images, audio): the foundation model era starts here.
  • When you need to compose components (search → reason → act): agents (Stage 11).
  • When you need natural language interfaces: LLMs.
  • When you have huge data and the structure is local (sequences, images): deep learning.

But for “I have a CSV and I want a number,” reach for LightGBM before anything else.

See also