Stage 01 — Math Foundations

“ML is mostly linear algebra wearing a trench coat.”

You don’t need to be a mathematician. You need fluency in four things:

  1. Linear algebra — the language of every operation in ML
  2. Probability & statistics — what models are uncertain about and why
  3. Calculus & optimization — how learning happens
  4. Information theory — what “loss” actually measures

This stage covers each at the depth needed to read papers, debug models, and stop nodding politely when someone says “covariance matrix.”

Prerequisites

  • High-school algebra
  • A Python environment with NumPy

Learning ladder

Read in this order:

  1. Linear algebra — vectors, matrices, dot products, eigendecomposition
  2. Probability & statistics — distributions, expectation, MLE, Bayes
  3. Calculus & optimization — derivatives, gradients, gradient descent
  4. Information theory — entropy, cross-entropy, KL divergence

The order matters: probability uses linear algebra; calculus is the bridge to optimization; information theory ties it all to loss functions.

Minimum viable understanding (MVU)

Before moving to Stage 02 you should be able to:

  • Compute a dot product, matrix-vector product, and matrix-matrix product by hand on small examples.
  • Explain why softmax outputs are non-negative and sum to 1.
  • Write the gradient descent update rule from memory: θ ← θ − η · ∇L(θ).
  • Define entropy in plain English and compute it for a 2-outcome distribution.
  • Explain what “negative log-likelihood = cross-entropy” means.

Exercises

  1. Dot product by hand. For a = [1,2,3] and b = [4,5,6], compute a·b. Then compute it in NumPy. Match.
  2. Softmax intuition. Implement softmax on [1,2,3] and [101,102,103]. Same output? Why or why not?
  3. Gradient descent on a parabola. Minimize f(x) = (x−3)² starting at x=0 with η=0.1. Plot x and f(x) over 50 steps.
  4. Entropy by hand. Compute the entropy of a fair coin and a 90/10 coin. Which is higher?

Common pitfalls

  • Skipping it because “I’ll learn it as I go.” You won’t. You’ll cargo-cult papers and never quite trust your own debugging.
  • Going too deep too soon. You don’t need measure theory or category theory. Resist the rabbit hole.
  • Not coding the math. Math you’ve only read fades. Math you’ve implemented sticks.

Where this stage feeds

  • Stage 02 (ML fundamentals) uses linear algebra and calculus end-to-end.
  • Stage 03 (neural networks) is calculus + linear algebra at scale.
  • Stage 06 (transformers) lives or dies on how well you understand matrix products and softmax.

See also