Stage 01 — Math Foundations

“ML is mostly linear algebra wearing a trench coat.”

You don’t need to be a mathematician. You need fluency in four things:

This stage covers each at the depth needed to read papers, debug models, and stop nodding politely when someone says “covariance matrix.”

Prerequisites

Read in this order:

The order matters: probability uses linear algebra; calculus is the bridge to optimization; information theory ties it all to loss functions.

Before moving to Stage 02 you should be able to:

Compute a dot product, matrix-vector product, and matrix-matrix product by hand on small examples.
Explain why softmax outputs are non-negative and sum to 1.
Write the gradient descent update rule from memory: θ ← θ − η · ∇L(θ).
Define entropy in plain English and compute it for a 2-outcome distribution.
Explain what “negative log-likelihood = cross-entropy” means.

Dot product by hand. For a = [1,2,3] and b = [4,5,6], compute a·b. Then compute it in NumPy. Match.
Softmax intuition. Implement softmax on [1,2,3] and [101,102,103]. Same output? Why or why not?
Gradient descent on a parabola. Minimize f(x) = (x−3)² starting at x=0 with η=0.1. Plot x and f(x) over 50 steps.
Entropy by hand. Compute the entropy of a fair coin and a 90/10 coin. Which is higher?

Skipping it because “I’ll learn it as I go.” You won’t. You’ll cargo-cult papers and never quite trust your own debugging.
Going too deep too soon. You don’t need measure theory or category theory. Resist the rabbit hole.
Not coding the math. Math you’ve only read fades. Math you’ve implemented sticks.

Stage 02 (ML fundamentals) uses linear algebra and calculus end-to-end.
Stage 03 (neural networks) is calculus + linear algebra at scale.
Stage 06 (transformers) lives or dies on how well you understand matrix products and softmax.