Stage 01 — Math Foundations: Solutions
Worked solutions for Stage 1 exercises.
Dependencies: numpy, matplotlib.
Linear algebra
Verify dot product = projection
For
a = [3, 4]andb = [1, 0], computea · b. Then compute|a| |b| cos θwhere θ is the angle between them.
import numpy as np
a = np.array([3.0, 4.0])
b = np.array([1.0, 0.0])
# Algebraic: dot product
dot = a @ b
print("a·b =", dot) # 3.0
# Geometric: |a| |b| cos θ
mag_a = np.linalg.norm(a) # √(9+16) = 5
mag_b = np.linalg.norm(b) # 1
cos_theta = (a @ b) / (mag_a * mag_b) # 0.6
geometric = mag_a * mag_b * cos_theta
print("|a||b|cosθ =", geometric) # 3.0
Both equal 3. The geometric reading: b is a unit vector along x, so a·b is the x-component of a, which is 3.
Implement matmul
Write
matmul(A, B)from scratch in pure Python. Check againstA @ B.
import numpy as np
def matmul(A, B):
rows_a, cols_a = len(A), len(A[0])
rows_b, cols_b = len(B), len(B[0])
assert cols_a == rows_b, "inner dims must match"
out = [[0.0] * cols_b for _ in range(rows_a)]
for i in range(rows_a):
for j in range(cols_b):
s = 0.0
for k in range(cols_a):
s += A[i][k] * B[k][j]
out[i][j] = s
return out
A = [[1, 2, 3], [4, 5, 6]] # 2x3
B = [[7, 8], [9, 10], [11, 12]] # 3x2
mine = matmul(A, B)
ref = (np.array(A) @ np.array(B)).tolist()
print(mine) # [[58.0, 64.0], [139.0, 154.0]]
print(ref) # [[58, 64], [139, 154]]
assert np.allclose(mine, ref)
Cosine vs Euclidean
Generate three vectors. Rank pairs by cosine similarity and by Euclidean distance. Are the rankings the same? Why not?
import numpy as np
a = np.array([1.0, 0.0])
b = np.array([2.0, 0.0]) # same direction as a, longer
c = np.array([0.0, 1.0]) # orthogonal to a
def cos(u, v):
return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))
def euc(u, v):
return np.linalg.norm(u - v)
pairs = [("a,b", a, b), ("a,c", a, c), ("b,c", b, c)]
print("Cosine similarity:")
for name, u, v in sorted(pairs, key=lambda p: -cos(p[1], p[2])):
print(f" {name}: {cos(u, v):.3f}")
print("Euclidean distance:")
for name, u, v in sorted(pairs, key=lambda p: euc(p[1], p[2])):
print(f" {name}: {euc(u, v):.3f}")
Output:
Cosine similarity:
a,b: 1.000 # same direction
a,c: 0.000 # orthogonal
b,c: 0.000
Euclidean distance:
a,b: 1.000 # different magnitudes
a,c: 1.414
b,c: 2.236
Why different: cosine ignores magnitude. a and b point the same way (cos = 1) but are 1 unit apart in Euclidean space. For embeddings where magnitudes are arbitrary, cosine is the right call.
Probability & statistics
Bayes — disease test
Test: 99% sensitivity, 95% specificity. Disease prevalence: 1%. You test positive. What’s
P(disease | +)?
# P(+|D) = 0.99 sensitivity
# P(-|~D) = 0.95 specificity → P(+|~D) = 0.05
# P(D) = 0.01
p_pos_given_d = 0.99
p_pos_given_not_d = 0.05
p_d = 0.01
p_not_d = 1 - p_d
# Bayes:
# P(D|+) = P(+|D) P(D) / P(+)
# P(+) = P(+|D)P(D) + P(+|~D)P(~D)
p_pos = p_pos_given_d * p_d + p_pos_given_not_d * p_not_d
p_d_given_pos = p_pos_given_d * p_d / p_pos
print(f"P(disease | positive) = {p_d_given_pos:.4f}")
# 0.1667 — roughly 17%
Despite a “99% accurate” test, you have a 17% chance of having the disease. Base rates dominate. This is why doctors should think probabilistically and why screening healthy populations needs careful interpretation.
Stable softmax
Implement softmax. Then write the stable version. Verify they give same answer for normal inputs and only the stable one survives
[1000, 1001, 1002].
import numpy as np
def softmax_naive(x):
x = np.asarray(x, dtype=np.float64)
e = np.exp(x)
return e / e.sum()
def softmax_stable(x):
x = np.asarray(x, dtype=np.float64)
x = x - x.max() # shift; doesn't change result
e = np.exp(x)
return e / e.sum()
# Normal input — both work
print(softmax_naive([1, 2, 3])) # [0.0900, 0.2447, 0.6652]
print(softmax_stable([1, 2, 3])) # same
# Big numbers
print(softmax_naive([1000, 1001, 1002])) # nan nan nan
print(softmax_stable([1000, 1001, 1002])) # [0.0900, 0.2447, 0.6652]
Why subtracting the max works: softmax(x)_i = e^{x_i} / Σ e^{x_j} = e^{x_i - C} / Σ e^{x_j - C} for any constant C. Pick C = max(x) so the largest exponent is 0 (no overflow).
Cross-entropy by hand
True label = class 2 (one-hot
[0, 0, 1, 0]). Predictions:[0.1, 0.2, 0.6, 0.1], then[0.0, 0.0, 1.0, 0.0], then[0.25, 0.25, 0.25, 0.25]. Order them.
import numpy as np
def ce(p_pred, true_class):
# p_pred: probability vector. true_class: index.
return -np.log(p_pred[true_class] + 1e-12)
print(ce([0.1, 0.2, 0.6, 0.1], 2)) # 0.5108
print(ce([0.0, 0.0, 1.0, 0.0], 2)) # ~0 (perfect)
print(ce([0.25, 0.25, 0.25, 0.25], 2)) # 1.3863
# Order (best to worst): perfect (0) < 0.6-confident (0.51) < uniform (1.39)
The perfect predictor gets ~0 loss. The uniform predictor gets -log(1/4) = log 4 ≈ 1.39. A confident-correct prediction sits in between.
Calculus & optimization
Gradient descent on a parabola
Minimize
f(x) = (x − 3)²starting atx = 0withη = 0.1. Plotxandf(x)over 50 steps.
import numpy as np
import matplotlib.pyplot as plt
def f(x): return (x - 3) ** 2
def df(x): return 2 * (x - 3)
x = 0.0
lr = 0.1
xs, losses = [x], [f(x)]
for _ in range(50):
x = x - lr * df(x)
xs.append(x)
losses.append(f(x))
print(f"final x = {x:.4f} (target 3)")
print(f"final loss = {f(x):.6f}")
fig, axes = plt.subplots(1, 2, figsize=(10, 3))
axes[0].plot(xs); axes[0].set_title("x over time"); axes[0].axhline(3, ls="--", c="grey")
axes[1].plot(losses); axes[1].set_title("loss over time")
plt.tight_layout(); plt.savefig("gd.png")
After ~50 steps, x ≈ 2.999.... The gradient gets smaller as we approach the minimum — convergence slows. With lr = 0.5 it converges in ~5 steps; with lr = 1.05 it diverges. Try both.
Manual backprop on a 2-layer net
For
y = w₂ · σ(w₁ · x + b₁) + b₂(σ = sigmoid), derive∂L/∂w₁forL = (y − target)².
By the chain rule:
L = (y − target)²
∂L/∂y = 2(y − target)
∂y/∂h₁ = w₂ where h₁ = σ(w₁ x + b₁)
∂h₁/∂z₁ = σ(z₁)·(1 − σ(z₁)) where z₁ = w₁ x + b₁
∂z₁/∂w₁ = x
∂L/∂w₁ = 2(y − target) · w₂ · σ(z₁)·(1 − σ(z₁)) · x
Verify against PyTorch:
import torch
torch.manual_seed(0)
x = torch.tensor(2.0)
target = torch.tensor(1.5)
w1 = torch.tensor(0.7, requires_grad=True)
b1 = torch.tensor(0.1, requires_grad=True)
w2 = torch.tensor(-0.3, requires_grad=True)
b2 = torch.tensor(0.2, requires_grad=True)
z1 = w1 * x + b1
h1 = torch.sigmoid(z1)
y = w2 * h1 + b2
loss = (y - target) ** 2
loss.backward()
# manual
sig = torch.sigmoid(z1).item()
manual = 2 * (y.item() - target.item()) * w2.item() * sig * (1 - sig) * x.item()
print(f"PyTorch: {w1.grad.item():.6f}")
print(f"Manual: {manual:.6f}")
# Both ~0.05076
Information theory
Entropy curve
Plot
H(p, 1−p)forp ∈ [0, 1]. Verify it peaks at 0.5.
import numpy as np
import matplotlib.pyplot as plt
def H(p):
p = np.asarray(p)
out = np.zeros_like(p, dtype=float)
mask = (p > 0) & (p < 1)
out[mask] = -p[mask] * np.log(p[mask]) - (1-p[mask]) * np.log(1-p[mask])
return out
ps = np.linspace(0, 1, 200)
hs = H(ps)
print(f"max entropy: {hs.max():.4f} at p = {ps[hs.argmax()]:.2f}")
# max ≈ ln 2 ≈ 0.693 at p = 0.5
plt.plot(ps, hs); plt.xlabel("p"); plt.ylabel("H(p, 1-p)"); plt.savefig("entropy.png")
Peak at p = 0.5, value ln 2 ≈ 0.693 nats (≈ 1 bit). A fair coin is the maximum-entropy 2-outcome distribution.
KL divergence
Compute
KL(P||Q)andKL(Q||P)forP = [0.5, 0.5]andQ = [0.9, 0.1]. Confirm asymmetry.
import numpy as np
def kl(p, q):
p = np.asarray(p); q = np.asarray(q)
return np.sum(p * np.log(p / q + 1e-12))
P = np.array([0.5, 0.5])
Q = np.array([0.9, 0.1])
print(f"KL(P||Q) = {kl(P, Q):.4f}") # 0.5108
print(f"KL(Q||P) = {kl(Q, P):.4f}") # 0.3680
Asymmetric. KL(P||Q) is bigger here — P has mass where Q doesn’t, which is what KL measures.
Perplexity sanity
A model assigns 1/V to every token (uniform). What’s its perplexity?
# PPL = exp(- (1/N) Σ log P(w_t))
# Uniform: P(w_t) = 1/V for all t
# log P = -log V
# Average: -log V
# PPL = exp(log V) = V
# So uniform model has PPL = vocabulary size.
A uniform 50k-vocab model has PPL = 50,000. A trained 7B LLM on Wikipedia has PPL ~5–15 — its effective branching factor at each step is tiny compared to the vocabulary. That gap is “compression” / “knowledge.”