Probability & Statistics for ML

ML models are fundamentally about uncertainty: given some data, what’s the most likely value of something we haven’t seen yet? Probability is how we reason about that.

The vocabulary

A random variable takes values according to some probability distribution. Capital letters by convention: X, Y. Lowercase for specific values: x, y.

A distribution assigns probability mass (discrete) or density (continuous) to outcomes.

A probability is a number in [0, 1]. The probabilities of all possible outcomes sum/integrate to 1.

Discrete distributions

Bernoulli

A single yes/no event with probability p of success.

P(X = 1) = p, P(X = 0) = 1 − p
Used for: binary classification outputs, individual coin flips, “did the user click?”

Categorical

The multi-class generalization. P(X = k) = pₖ for k = 1..K, with Σ pₖ = 1.

Used for: multi-class classification (after softmax), token sampling in language models, dice rolls.

The output of softmax is a categorical distribution over K classes. Every time a language model picks a token, it samples from a categorical distribution over the vocabulary.

Binomial

n independent Bernoulli trials with the same p. Models “how many successes in n tries?”

Poisson

Counts of rare events in a fixed interval. Used for arrival rates, defect counts, request rates.

Continuous distributions

Uniform

Equally likely over an interval [a, b]. Used in initialization, rejection sampling, simple priors.

Normal (Gaussian)

The bell curve. Parameterized by mean μ and variance σ².

p(x) = (1 / √(2πσ²)) · exp(−(x−μ)² / (2σ²))

Why it shows up everywhere:

Central Limit Theorem: sums of many independent variables tend to normal, regardless of the underlying distribution.
Closed-form math for many operations.
Maximum-entropy distribution for a given mean and variance.
Used in weight initialization (e.g. Xavier, He), VAEs, diffusion models.

Multivariate normal

A normal distribution over a vector. Parameterized by mean vector μ and covariance matrix Σ.

The covariance matrix encodes how features co-vary. Diagonal Σ = independent features. Full Σ = features can correlate.

Joint, marginal, conditional

Three views of the same thing.

Joint P(X, Y): probability of both.
Marginal P(X) = Σ_y P(X, Y=y): probability of X averaged over Y.
Conditional P(X | Y) = P(X, Y) / P(Y): probability of X given that Y happened.

A language model is conditional. Each token is sampled from P(token_t | tokens_{<t}). The whole field of generative AI is built on conditional distributions.

Bayes’ rule

P(A | B) = P(B | A) · P(A) / P(B)

Read it as: “posterior = likelihood × prior / evidence.” This is how we update beliefs.

In ML:

Naive Bayes classifier — direct application
Bayesian neural networks — distributions over weights
MAP estimation — find the maximum a posteriori parameters
Diffusion models — reverse-time process is governed by Bayes’ rule on noise schedules

Expectation, variance, covariance

Expectation E[X] is the mean. For a discrete X: E[X] = Σ x · P(X=x). For continuous: integral.

Variance Var(X) = E[(X − E[X])²]. How spread out the values are.

Covariance Cov(X, Y) = E[(X − E[X])(Y − E[Y])]. How two variables move together.

Correlation = covariance normalized by standard deviations. Ranges in [-1, 1].

These appear:

Loss functions are expectations: L(θ) = E_(x,y)~D [ ℓ(f_θ(x), y) ]
Gradient estimators have variance — that’s why batch size matters
PCA finds directions of maximum variance

Maximum likelihood estimation (MLE)

Pick parameters that make the observed data most probable.

θ_MLE = argmax_θ P(data | θ) = argmax_θ Π_i P(x_i | θ)

We usually take the log (sum is more stable than product):

θ_MLE = argmax_θ Σ_i log P(x_i | θ)

Most ML training is MLE. Cross-entropy loss = negative log-likelihood under a categorical model. Squared loss = negative log-likelihood under a Gaussian model. When you minimize loss, you’re maximizing likelihood.

Maximum a posteriori (MAP)

Like MLE but with a prior:

θ_MAP = argmax_θ P(data | θ) · P(θ)

A prior on weights = regularization. Gaussian prior on weights → L2 regularization. Laplace prior → L1.

Sampling

Drawing values from a distribution.

Inverse-CDF sampling — works for any 1D distribution if you have the CDF.
Rejection sampling — propose, accept with probability proportional to target/proposal.
MCMC (Markov chain Monte Carlo) — for high-dim distributions where direct sampling fails.
Reparameterization trick — sample from a simple distribution, transform deterministically. Used in VAEs and diffusion training.

In language models, decoding = sampling from the categorical distribution at each step, with knobs:

Temperature: divides logits by T before softmax. T < 1 sharpens (greedier); T > 1 flattens (more random).
Top-k: keep only the k highest-probability tokens, renormalize, sample.
Top-p (nucleus): keep the smallest set of tokens whose cumulative probability exceeds p.

The KL divergence

Measures how different two distributions are.

KL(P || Q) = Σ_x P(x) log(P(x) / Q(x))

Properties:

Always ≥ 0
KL = 0 iff P = Q
Asymmetric: KL(P||Q) ≠ KL(Q||P)

Used in:

VAEs (loss = reconstruction + KL to prior)
RLHF / DPO / PPO (KL penalty keeps the fine-tuned model close to the reference)
Knowledge distillation (student matches teacher distribution via KL)
Variational inference

Pitfall: KL is undefined where Q(x) = 0 but P(x) > 0. In practice, smooth or use a stabilized variant.

Cross-entropy

The loss every classifier uses.

H(P, Q) = − Σ_x P(x) log Q(x) = H(P) + KL(P || Q)

For one-hot labels (P puts all mass on the true class), cross-entropy collapses to −log Q(true_class). That’s nn.CrossEntropyLoss in PyTorch.

Statistics: estimation and confidence

When working with finite data:

Sample mean estimates true mean. Variance of the estimate shrinks as 1/n.
Standard error = σ/√n. The “noise” in your mean estimate.
Confidence intervals: a 95% CI is “if I repeated this experiment many times, 95% of computed intervals would contain the true value.”
Hypothesis testing: not how ML evals usually work, but you’ll see p-values in research papers.

Practical pitfalls

Underflow. Multiplying many probabilities → 0. Always work in log space.
Independence assumed but not held. “i.i.d.” is an assumption that breaks in time-series data, social data, and most real-world data.
Confusing P(A|B) with P(B|A). Doctors do this. So do engineers. Bayes’ rule keeps you honest.
Sampling without replacement. For gradient descent, mini-batches are typically sampled without replacement within an epoch.

Exercises

Compute a posterior. A test for a disease has 99% sensitivity and 95% specificity. The disease has 1% base rate. You test positive. What’s P(disease | positive)?
Implement softmax. Write softmax in NumPy. Then write the stable version (subtract max from logits). Verify they give the same answer for normal inputs and that only the stable one survives [1000, 1001, 1002].
Cross-entropy by hand. True label is class 2 (one-hot [0, 0, 1, 0]). Predicted probs are [0.1, 0.2, 0.6, 0.1]. Compute cross-entropy. Now if predicted is [0.0, 0.0, 1.0, 0.0]. Now if [0.25, 0.25, 0.25, 0.25]. Order them.
KL and entropy. Write KL(P || Q) in NumPy. Verify KL(P||P) = 0 and KL is asymmetric on a small example.