Linear Algebra for ML
Linear algebra is the language ML is written in. Every model is a stack of matrix operations; every embedding is a vector; every attention computation is softmax(QKᵀ/√d)V. Fluency here pays for itself daily.
Why ML uses linear algebra
Three reasons:
- GPUs do matrix multiplication fast. Anything you can express as a matmul, the hardware will fly through. So the field structures models around matmul-friendly ops.
- Vectors capture meaning geometrically. Similar things end up close in vector space. Distance becomes similarity. Direction becomes attribute.
- It composes. Stack matrices, get a deeper transformation. Differentiate matrices, get a gradient. The whole stack is linear-algebra-shaped.
Vectors
A vector is an ordered list of numbers. Geometrically, a point or arrow in n-dimensional space.
import numpy as np
v = np.array([1.0, 2.0, 3.0]) # 3-dimensional vector
In ML, vectors represent:
- A single data point (
[height, weight, age]) - An embedding (
[0.13, -0.27, ..., 0.04]of length 768 or 1536 or 4096) - Model parameters (weights of one neuron)
- A token’s representation at one layer of a transformer
Operations
Addition — element-wise, same shape:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a + b # [5, 7, 9]
Scalar multiplication — scales all elements:
2 * a # [2, 4, 6]
Dot product — sum of element-wise products:
np.dot(a, b) # 1*4 + 2*5 + 3*6 = 32
a @ b # same thing, modern syntax
The dot product has two interpretations, both essential:
- Algebraic:
a · b = Σ aᵢbᵢ - Geometric:
a · b = |a| |b| cos θwhere θ is the angle between them
The geometric interpretation is why dot product = similarity: vectors pointing the same way have a high dot product; orthogonal vectors have zero.
Norm (length): |v| = √(v · v). The L2 norm. Other norms exist (L1 = sum of absolute values; L∞ = max element); L2 is by far the most common.
Cosine similarity: dot product, normalized by lengths.
def cosine(a, b):
return (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))
Range: [−1, 1]. Used everywhere in retrieval, RAG, and embedding-based search.
Matrices
A matrix is a 2D array. Shape (m, n) = m rows, n columns.
M = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
In ML, matrices represent:
- A batch of vectors (rows = examples, columns = features)
- A linear transformation (a function from one vector space to another)
- A weight matrix in a neural network layer
Matrix-vector product
Mv applies the transformation M to the vector v. If M is (m, n) and v is (n,), the result is (m,).
Mechanically: each row of M dot-products with v to produce one element of the output.
M = np.array([[1, 2], [3, 4], [5, 6]]) # (3, 2)
v = np.array([1, 1]) # (2,)
M @ v # [3, 7, 11], shape (3,)
This is one fully-connected layer in a neural network: output = W @ input + b.
Matrix-matrix product
AB where A is (m, k) and B is (k, n) produces (m, n). The inner dimensions must match.
Each entry (AB)ᵢⱼ is the dot product of row i of A with column j of B.
A = np.random.randn(64, 128)
B = np.random.randn(128, 256)
C = A @ B # (64, 256)
In a transformer, this is how a batch of token embeddings (64 tokens × 128 features) gets transformed by a weight matrix into a different feature space.
Pitfall: matrix multiplication is not commutative.
AB ≠ BAin general (and may not even be defined in both orders).
Transpose
Aᵀ swaps rows and columns. Shape (m, n) becomes (n, m).
A.T
Used everywhere — most importantly in softmax(QKᵀ/√d)V in attention.
Special matrices
- Identity
I: ones on the diagonal, zeros elsewhere.IA = AI = A. - Diagonal: nonzero only on the diagonal. Easy to invert.
- Symmetric:
A = Aᵀ. Has real eigenvalues. - Orthogonal:
AᵀA = I. Preserves lengths and angles. Rotation/reflection.
Inverse
A⁻¹ undoes A: A A⁻¹ = I. Only square matrices can have an inverse, and only if they’re non-singular (determinant ≠ 0).
In practice, don’t compute matrix inverses in ML code. Solve linear systems with np.linalg.solve(A, b) instead — faster and more numerically stable.
Eigenvectors and eigenvalues
For a square matrix A, an eigenvector v satisfies Av = λv for some scalar λ (the eigenvalue). The matrix only stretches v; it doesn’t change its direction.
Why care? Eigendecomposition reveals the principal axes of a transformation. PCA, spectral clustering, and parts of optimization theory all rest on it.
eigvals, eigvecs = np.linalg.eig(A)
For symmetric matrices (like covariance matrices), the eigenvectors are orthogonal and the eigenvalues are real — both very nice properties.
Singular value decomposition (SVD)
A = U Σ Vᵀ for any matrix. The most general decomposition there is. Used in:
- PCA (the singular values are the standard deviations along principal axes)
- LoRA fine-tuning (low-rank approximation by truncating small singular values)
- Embedding compression
- Numerical-stability tricks in optimization
U, S, Vt = np.linalg.svd(A)
The singular values in S are non-negative and sorted in decreasing order. Keep the top k → get the best rank-k approximation of A. This is the heart of LoRA (Stage 10).
Broadcasting
NumPy and PyTorch silently expand smaller arrays to match larger ones when shapes are compatible. This is broadcasting — it makes ML code concise but is a common source of bugs.
Rules: align shapes from the right. Each pair of dimensions must be equal, or one must be 1.
M = np.ones((3, 4))
v = np.array([1, 2, 3, 4]) # shape (4,)
M + v # works: v is broadcast across 3 rows
Pitfall: broadcasting + an unintended dimension can silently produce wrong results without an error. Always print
.shape.
Tensors
A tensor is the generalization: 0-D = scalar, 1-D = vector, 2-D = matrix, 3-D and beyond = “tensor” colloquially.
In ML, common tensor shapes:
(batch, features)— classical ML(batch, channels, height, width)— vision(batch, sequence_length, hidden_size)— language models(batch, heads, sequence_length, head_dim)— multi-head attention
PyTorch and JAX are tensor-first frameworks; NumPy works fine for learning.
Why this matters for transformers
Self-attention, the central operation in modern AI, is:
attention(Q, K, V) = softmax(QKᵀ / √d_k) V
Every symbol there is from this chapter:
Q,K,Vare matrices:(seq_len, d_k)QKᵀis matrix multiplication producing a(seq_len, seq_len)similarity matrix√d_kis a scalar normalizing for stable gradientssoftmaxturns each row into a probability distribution- The final matmul with
Vproduces the new representations
If QKᵀ and matrix multiplication feel solid, attention is just a recipe.
Exercises
- Verify dot product = projection. Take
a = [3, 4]andb = [1, 0]. Computea·b. Then compute|a| |b| cos θwhere θ is the angle between them. Match. - Implement matmul. Write
matmul(A, B)from scratch in pure Python (no NumPy). Check againstA @ B. - PCA from SVD. Take 1000 random 2D points stretched along an axis. SVD them. The first right singular vector should align with that axis.
- Cosine vs Euclidean. Generate three vectors. Rank pairs by cosine similarity and by Euclidean distance. Are the rankings the same? Why not?
See also
- Calculus & optimization — gradients build on matrix-valued derivatives
- Stage 06 — Self-attention — where this all pays off
- Stage 10 — LoRA — low-rank decomposition in the wild