01

stage · curriculum

Math Foundations

Math fluency, not mastery. Linear algebra is the language; calculus drives learning; probability quantifies uncertainty; information theory measures loss. Skip and you'll cargo-cult papers forever; spend a week and softmax stops being a black box.

4 articles
21 min to read
5 demos
4 books
if you only do one thing

Linear algebra is the language every operation in ML speaks. Read it once, then drag two vectors in the demo until dot products feel inevitable.

Articles in this stage

  1. 01 Calculus & Optimization
  2. 02 Information Theory for ML
  3. 03 Linear Algebra for ML
  4. 04 Probability & Statistics for ML

Stage 01 — Math Foundations

“ML is mostly linear algebra wearing a trench coat.”

You don’t need to be a mathematician. You need fluency in four things:

  1. Linear algebra — the language of every operation in ML
  2. Probability & statistics — what models are uncertain about and why
  3. Calculus & optimization — how learning happens
  4. Information theory — what “loss” actually measures

This stage covers each at the depth needed to read papers, debug models, and stop nodding politely when someone says “covariance matrix.”

Prerequisites

  • High-school algebra
  • A Python environment with NumPy

Learning ladder

Read in this order:

  1. Linear algebra — vectors, matrices, dot products, eigendecomposition
  2. Probability & statistics — distributions, expectation, MLE, Bayes
  3. Calculus & optimization — derivatives, gradients, gradient descent
  4. Information theory — entropy, cross-entropy, KL divergence

The order matters: probability uses linear algebra; calculus is the bridge to optimization; information theory ties it all to loss functions.

Minimum viable understanding (MVU)

Before moving to Stage 02 you should be able to:

  • Compute a dot product, matrix-vector product, and matrix-matrix product by hand on small examples.
  • Explain why softmax outputs are non-negative and sum to 1.
  • Write the gradient descent update rule from memory: θ ← θ − η · ∇L(θ).
  • Define entropy in plain English and compute it for a 2-outcome distribution.
  • Explain what “negative log-likelihood = cross-entropy” means.

Exercises

  1. Dot product by hand. For a = [1,2,3] and b = [4,5,6], compute a·b. Then compute it in NumPy. Match.
  2. Softmax intuition. Implement softmax on [1,2,3] and [101,102,103]. Same output? Why or why not?
  3. Gradient descent on a parabola. Minimize f(x) = (x−3)² starting at x=0 with η=0.1. Plot x and f(x) over 50 steps.
  4. Entropy by hand. Compute the entropy of a fair coin and a 90/10 coin. Which is higher?

Common pitfalls

  • Skipping it because “I’ll learn it as I go.” You won’t. You’ll cargo-cult papers and never quite trust your own debugging.
  • Going too deep too soon. You don’t need measure theory or category theory. Resist the rabbit hole.
  • Not coding the math. Math you’ve only read fades. Math you’ve implemented sticks.

Where this stage feeds

  • Stage 02 (ML fundamentals) uses linear algebra and calculus end-to-end.
  • Stage 03 (neural networks) is calculus + linear algebra at scale.
  • Stage 06 (transformers) lives or dies on how well you understand matrix products and softmax.

See also

Further reading

Books move slower than papers in this field — treat these as foundations, not replacements for the latest research. Real authors, real publishers, real editions. Free badges mark books with author-authorized full text online.

  1. ★ start here
    Mathematics for Machine Learning coverfree

    Mathematics for Machine Learning

    Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

    Cambridge University Press, 2020

    The bible for this stage. Free PDF online.

  2. An Introduction to Statistical Learning (with applications in Python) coverfree

    An Introduction to Statistical Learning (with applications in Python)

    Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Jonathan Taylor

    Springer, Python ed., 2023

    The most accessible bridge from stats to ML. Free PDF online.

  3. Pattern Recognition and Machine Learning coverfree

    Pattern Recognition and Machine Learning

    Christopher M. Bishop

    Springer, 2006

    The deep theoretical grounding on linear models, kernels, graphical models.

  4. Probabilistic Machine Learning: An Introduction coverfree

    Probabilistic Machine Learning: An Introduction

    Kevin P. Murphy

    MIT Press, 2022

    The modern, broader successor to Bishop. Free PDF online.