Stage 01 — Math Foundations
“ML is mostly linear algebra wearing a trench coat.”
You don’t need to be a mathematician. You need fluency in four things:
- Linear algebra — the language of every operation in ML
- Probability & statistics — what models are uncertain about and why
- Calculus & optimization — how learning happens
- Information theory — what “loss” actually measures
This stage covers each at the depth needed to read papers, debug models, and stop nodding politely when someone says “covariance matrix.”
Prerequisites
- High-school algebra
- A Python environment with NumPy
Learning ladder
Read in this order:
- Linear algebra — vectors, matrices, dot products, eigendecomposition
- Probability & statistics — distributions, expectation, MLE, Bayes
- Calculus & optimization — derivatives, gradients, gradient descent
- Information theory — entropy, cross-entropy, KL divergence
The order matters: probability uses linear algebra; calculus is the bridge to optimization; information theory ties it all to loss functions.
Minimum viable understanding (MVU)
Before moving to Stage 02 you should be able to:
- Compute a dot product, matrix-vector product, and matrix-matrix product by hand on small examples.
- Explain why softmax outputs are non-negative and sum to 1.
- Write the gradient descent update rule from memory:
θ ← θ − η · ∇L(θ). - Define entropy in plain English and compute it for a 2-outcome distribution.
- Explain what “negative log-likelihood = cross-entropy” means.
Exercises
- Dot product by hand. For
a = [1,2,3]andb = [4,5,6], computea·b. Then compute it in NumPy. Match. - Softmax intuition. Implement softmax on
[1,2,3]and[101,102,103]. Same output? Why or why not? - Gradient descent on a parabola. Minimize
f(x) = (x−3)²starting atx=0withη=0.1. Plotxandf(x)over 50 steps. - Entropy by hand. Compute the entropy of a fair coin and a 90/10 coin. Which is higher?
Common pitfalls
- Skipping it because “I’ll learn it as I go.” You won’t. You’ll cargo-cult papers and never quite trust your own debugging.
- Going too deep too soon. You don’t need measure theory or category theory. Resist the rabbit hole.
- Not coding the math. Math you’ve only read fades. Math you’ve implemented sticks.
Where this stage feeds
- Stage 02 (ML fundamentals) uses linear algebra and calculus end-to-end.
- Stage 03 (neural networks) is calculus + linear algebra at scale.
- Stage 06 (transformers) lives or dies on how well you understand matrix products and softmax.