Classical ML Algorithms
You can’t appreciate deep learning without seeing what it replaced. And on tabular data, you should usually try these first.
Linear regression
The simplest predictor: ŷ = w · x + b. Fit w, b to minimize MSE.
- Closed-form solution:
w = (XᵀX)⁻¹ Xᵀy(the normal equations). - Or solve via gradient descent (scales to bigger data).
- Assumes linear relationship + Gaussian residuals.
- Add interactions (
x₁ · x₂) or polynomial features for non-linearity.
When to use: a baseline. If linear regression with reasonable features doesn’t beat random, your features are bad.
Logistic regression
Binary classification: ŷ = sigmoid(w · x + b). Fit by minimizing binary cross-entropy.
- Output is a calibrated probability.
- No closed-form; solve via gradient descent or Newton’s method.
- Multinomial extension via softmax = a one-layer neural network.
A logistic regression with engineered features was the workhorse of industry ML through ~2014.
k-Nearest Neighbors (kNN)
No training. To predict, find the k closest training points and vote (classification) or average (regression).
- Simple, intuitive, surprisingly competitive on small data.
- Slow at inference (O(n) per query, unless you use indexes).
- Curse of dimensionality: in high-dim space, “nearest” becomes meaningless.
Modern echo: vector search in RAG (Stage 09) is approximate kNN over embeddings.
Decision trees
Recursively split the data on the feature/threshold that maximizes information gain (or minimizes Gini impurity / variance).
- Capture non-linearities and interactions automatically.
- Interpretable (each leaf = a rule path).
- High variance — small data changes → very different trees.
Single trees are weak; ensembles of trees are powerful.
Random forests
Train many trees, each on a bootstrap sample with a random feature subset per split. Average their predictions.
- Reduces variance dramatically.
- Robust default — usually competitive on tabular data with little tuning.
- Feature importances built in.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500).fit(X_train, y_train)
Gradient-boosted trees
Build trees sequentially, each fitting the residuals of the ensemble so far.
Fₘ(x) = F_{m-1}(x) + ν · h_m(x)
Where h_m is a tree fit to the gradient of the loss w.r.t. F_{m-1}. ν is a learning rate.
Implementations:
- XGBoost: rigorous, fast, the original.
- LightGBM: fastest, leaf-wise growth.
- CatBoost: best with categorical features, ordered boosting.
On tabular data, gradient-boosted trees beat deep learning most of the time as of 2026. Try them first.
import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, max_depth=-1)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(50)])
Support Vector Machines (SVMs)
Find the hyperplane with maximum margin between classes. With kernel tricks, can do non-linear boundaries.
- Powerful on small to medium data.
- Don’t scale well to millions of examples (kernel computations are O(n²)).
- Largely supplanted by trees and neural nets, but still taught and used in some niches.
Naive Bayes
Apply Bayes’ rule with a strong “naive” assumption: features are conditionally independent given the class.
P(class | features) ∝ P(class) · Π P(featureᵢ | class)
- Fast, simple, surprisingly effective on text classification.
- Used to be the default spam classifier.
Linear/Quadratic Discriminant Analysis (LDA/QDA)
Generative classifiers assuming Gaussian features per class. LDA assumes shared covariance; QDA fits separate covariances.
- Strong baselines when assumptions hold.
- LDA is also a dimensionality-reduction technique (project onto axes that maximize class separation).
Clustering algorithms
Recap from unsupervised-learning.md:
- k-means, DBSCAN/HDBSCAN, hierarchical clustering, GMM.
Dimensionality reduction
- PCA (linear), t-SNE (visualization), UMAP (modern default), autoencoders (neural).
Practical advice
- Start with a baseline: linear/logistic regression with engineered features. Beat it before reaching for anything fancy.
- For tabular: gradient-boosted trees (LightGBM, XGBoost, CatBoost). Often the right answer.
- For unstructured data (text, image, audio): use foundation model embeddings + a simple classifier on top, or fine-tune the foundation model directly.
- Hyperparameter search: random search beats grid search; Bayesian optimization (Optuna) beats random search.
- Pipelines: scikit-learn
Pipelinekeeps preprocessing + model together. Use them.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
When NOT to use classical ML
- Unstructured data (text, images, audio): the foundation model era starts here.
- When you need to compose components (search → reason → act): agents (Stage 11).
- When you need natural language interfaces: LLMs.
- When you have huge data and the structure is local (sequences, images): deep learning.
But for “I have a CSV and I want a number,” reach for LightGBM before anything else.