Stage 01 — Math Foundations: Solutions

Worked solutions for Stage 1 exercises.

Dependencies: numpy, matplotlib.

Linear algebra

Verify dot product = projection

For a = [3, 4] and b = [1, 0], compute a · b. Then compute |a| |b| cos θ where θ is the angle between them.

import numpy as np

a = np.array([3.0, 4.0])
b = np.array([1.0, 0.0])

# Algebraic: dot product
dot = a @ b
print("a·b =", dot)              # 3.0

# Geometric: |a| |b| cos θ
mag_a = np.linalg.norm(a)        # √(9+16) = 5
mag_b = np.linalg.norm(b)        # 1
cos_theta = (a @ b) / (mag_a * mag_b)   # 0.6
geometric = mag_a * mag_b * cos_theta
print("|a||b|cosθ =", geometric)  # 3.0

Both equal 3. The geometric reading: b is a unit vector along x, so a·b is the x-component of a, which is 3.

Implement matmul

Write matmul(A, B) from scratch in pure Python. Check against A @ B.

import numpy as np

def matmul(A, B):
    rows_a, cols_a = len(A), len(A[0])
    rows_b, cols_b = len(B), len(B[0])
    assert cols_a == rows_b, "inner dims must match"

    out = [[0.0] * cols_b for _ in range(rows_a)]
    for i in range(rows_a):
        for j in range(cols_b):
            s = 0.0
            for k in range(cols_a):
                s += A[i][k] * B[k][j]
            out[i][j] = s
    return out

A = [[1, 2, 3], [4, 5, 6]]      # 2x3
B = [[7, 8], [9, 10], [11, 12]]  # 3x2

mine = matmul(A, B)
ref = (np.array(A) @ np.array(B)).tolist()

print(mine)   # [[58.0, 64.0], [139.0, 154.0]]
print(ref)    # [[58, 64], [139, 154]]
assert np.allclose(mine, ref)

Cosine vs Euclidean

Generate three vectors. Rank pairs by cosine similarity and by Euclidean distance. Are the rankings the same? Why not?

import numpy as np

a = np.array([1.0, 0.0])
b = np.array([2.0, 0.0])    # same direction as a, longer
c = np.array([0.0, 1.0])    # orthogonal to a

def cos(u, v):
    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

def euc(u, v):
    return np.linalg.norm(u - v)

pairs = [("a,b", a, b), ("a,c", a, c), ("b,c", b, c)]

print("Cosine similarity:")
for name, u, v in sorted(pairs, key=lambda p: -cos(p[1], p[2])):
    print(f"  {name}: {cos(u, v):.3f}")

print("Euclidean distance:")
for name, u, v in sorted(pairs, key=lambda p: euc(p[1], p[2])):
    print(f"  {name}: {euc(u, v):.3f}")

Output:

Cosine similarity:
  a,b: 1.000        # same direction
  a,c: 0.000        # orthogonal
  b,c: 0.000

Euclidean distance:
  a,b: 1.000        # different magnitudes
  a,c: 1.414
  b,c: 2.236

Why different: cosine ignores magnitude. a and b point the same way (cos = 1) but are 1 unit apart in Euclidean space. For embeddings where magnitudes are arbitrary, cosine is the right call.

Probability & statistics

Bayes — disease test

Test: 99% sensitivity, 95% specificity. Disease prevalence: 1%. You test positive. What’s P(disease | +)?

# P(+|D) = 0.99   sensitivity
# P(-|~D) = 0.95  specificity → P(+|~D) = 0.05
# P(D) = 0.01

p_pos_given_d = 0.99
p_pos_given_not_d = 0.05
p_d = 0.01
p_not_d = 1 - p_d

# Bayes:
# P(D|+) = P(+|D) P(D) / P(+)
# P(+) = P(+|D)P(D) + P(+|~D)P(~D)

p_pos = p_pos_given_d * p_d + p_pos_given_not_d * p_not_d
p_d_given_pos = p_pos_given_d * p_d / p_pos

print(f"P(disease | positive) = {p_d_given_pos:.4f}")
# 0.1667 — roughly 17%

Despite a “99% accurate” test, you have a 17% chance of having the disease. Base rates dominate. This is why doctors should think probabilistically and why screening healthy populations needs careful interpretation.

Stable softmax

Implement softmax. Then write the stable version. Verify they give same answer for normal inputs and only the stable one survives [1000, 1001, 1002].

import numpy as np

def softmax_naive(x):
    x = np.asarray(x, dtype=np.float64)
    e = np.exp(x)
    return e / e.sum()

def softmax_stable(x):
    x = np.asarray(x, dtype=np.float64)
    x = x - x.max()              # shift; doesn't change result
    e = np.exp(x)
    return e / e.sum()

# Normal input — both work
print(softmax_naive([1, 2, 3]))   # [0.0900, 0.2447, 0.6652]
print(softmax_stable([1, 2, 3]))  # same

# Big numbers
print(softmax_naive([1000, 1001, 1002]))   # nan nan nan
print(softmax_stable([1000, 1001, 1002]))  # [0.0900, 0.2447, 0.6652]

Why subtracting the max works: softmax(x)_i = e^{x_i} / Σ e^{x_j} = e^{x_i - C} / Σ e^{x_j - C} for any constant C. Pick C = max(x) so the largest exponent is 0 (no overflow).

Cross-entropy by hand

True label = class 2 (one-hot [0, 0, 1, 0]). Predictions: [0.1, 0.2, 0.6, 0.1], then [0.0, 0.0, 1.0, 0.0], then [0.25, 0.25, 0.25, 0.25]. Order them.

import numpy as np

def ce(p_pred, true_class):
    # p_pred: probability vector. true_class: index.
    return -np.log(p_pred[true_class] + 1e-12)

print(ce([0.1, 0.2, 0.6, 0.1], 2))   # 0.5108
print(ce([0.0, 0.0, 1.0, 0.0], 2))   # ~0  (perfect)
print(ce([0.25, 0.25, 0.25, 0.25], 2))  # 1.3863

# Order (best to worst): perfect (0) < 0.6-confident (0.51) < uniform (1.39)

The perfect predictor gets ~0 loss. The uniform predictor gets -log(1/4) = log 4 ≈ 1.39. A confident-correct prediction sits in between.

Calculus & optimization

Gradient descent on a parabola

Minimize f(x) = (x − 3)² starting at x = 0 with η = 0.1. Plot x and f(x) over 50 steps.

import numpy as np
import matplotlib.pyplot as plt

def f(x): return (x - 3) ** 2
def df(x): return 2 * (x - 3)

x = 0.0
lr = 0.1

xs, losses = [x], [f(x)]
for _ in range(50):
    x = x - lr * df(x)
    xs.append(x)
    losses.append(f(x))

print(f"final x = {x:.4f} (target 3)")
print(f"final loss = {f(x):.6f}")

fig, axes = plt.subplots(1, 2, figsize=(10, 3))
axes[0].plot(xs); axes[0].set_title("x over time"); axes[0].axhline(3, ls="--", c="grey")
axes[1].plot(losses); axes[1].set_title("loss over time")
plt.tight_layout(); plt.savefig("gd.png")

After ~50 steps, x ≈ 2.999.... The gradient gets smaller as we approach the minimum — convergence slows. With lr = 0.5 it converges in ~5 steps; with lr = 1.05 it diverges. Try both.

Manual backprop on a 2-layer net

For y = w₂ · σ(w₁ · x + b₁) + b₂ (σ = sigmoid), derive ∂L/∂w₁ for L = (y − target)².

By the chain rule:

L = (y − target)²

∂L/∂y = 2(y − target)
∂y/∂h₁ = w₂                    where h₁ = σ(w₁ x + b₁)
∂h₁/∂z₁ = σ(z₁)·(1 − σ(z₁))    where z₁ = w₁ x + b₁
∂z₁/∂w₁ = x

∂L/∂w₁ = 2(y − target) · w₂ · σ(z₁)·(1 − σ(z₁)) · x

Verify against PyTorch:

import torch

torch.manual_seed(0)
x = torch.tensor(2.0)
target = torch.tensor(1.5)
w1 = torch.tensor(0.7, requires_grad=True)
b1 = torch.tensor(0.1, requires_grad=True)
w2 = torch.tensor(-0.3, requires_grad=True)
b2 = torch.tensor(0.2, requires_grad=True)

z1 = w1 * x + b1
h1 = torch.sigmoid(z1)
y = w2 * h1 + b2
loss = (y - target) ** 2
loss.backward()

# manual
sig = torch.sigmoid(z1).item()
manual = 2 * (y.item() - target.item()) * w2.item() * sig * (1 - sig) * x.item()

print(f"PyTorch: {w1.grad.item():.6f}")
print(f"Manual:  {manual:.6f}")
# Both ~0.05076

Information theory

Entropy curve

Plot H(p, 1−p) for p ∈ [0, 1]. Verify it peaks at 0.5.

import numpy as np
import matplotlib.pyplot as plt

def H(p):
    p = np.asarray(p)
    out = np.zeros_like(p, dtype=float)
    mask = (p > 0) & (p < 1)
    out[mask] = -p[mask] * np.log(p[mask]) - (1-p[mask]) * np.log(1-p[mask])
    return out

ps = np.linspace(0, 1, 200)
hs = H(ps)

print(f"max entropy: {hs.max():.4f} at p = {ps[hs.argmax()]:.2f}")
# max ≈ ln 2 ≈ 0.693 at p = 0.5

plt.plot(ps, hs); plt.xlabel("p"); plt.ylabel("H(p, 1-p)"); plt.savefig("entropy.png")

Peak at p = 0.5, value ln 2 ≈ 0.693 nats (≈ 1 bit). A fair coin is the maximum-entropy 2-outcome distribution.

KL divergence

Compute KL(P||Q) and KL(Q||P) for P = [0.5, 0.5] and Q = [0.9, 0.1]. Confirm asymmetry.

import numpy as np

def kl(p, q):
    p = np.asarray(p); q = np.asarray(q)
    return np.sum(p * np.log(p / q + 1e-12))

P = np.array([0.5, 0.5])
Q = np.array([0.9, 0.1])

print(f"KL(P||Q) = {kl(P, Q):.4f}")   # 0.5108
print(f"KL(Q||P) = {kl(Q, P):.4f}")   # 0.3680

Asymmetric. KL(P||Q) is bigger here — P has mass where Q doesn’t, which is what KL measures.

Perplexity sanity

A model assigns 1/V to every token (uniform). What’s its perplexity?

# PPL = exp(- (1/N) Σ log P(w_t))
# Uniform: P(w_t) = 1/V for all t
# log P = -log V
# Average: -log V
# PPL = exp(log V) = V

# So uniform model has PPL = vocabulary size.

A uniform 50k-vocab model has PPL = 50,000. A trained 7B LLM on Wikipedia has PPL ~5–15 — its effective branching factor at each step is tiny compared to the vocabulary. That gap is “compression” / “knowledge.”