Linear Algebra for ML

Linear algebra is the language ML is written in. Every model is a stack of matrix operations; every embedding is a vector; every attention computation is softmax(QKᵀ/√d)V. Fluency here pays for itself daily.

Why ML uses linear algebra

Three reasons:

GPUs do matrix multiplication fast. Anything you can express as a matmul, the hardware will fly through. So the field structures models around matmul-friendly ops.
Vectors capture meaning geometrically. Similar things end up close in vector space. Distance becomes similarity. Direction becomes attribute.
It composes. Stack matrices, get a deeper transformation. Differentiate matrices, get a gradient. The whole stack is linear-algebra-shaped.

Vectors

A vector is an ordered list of numbers. Geometrically, a point or arrow in n-dimensional space.

import numpy as np
v = np.array([1.0, 2.0, 3.0])  # 3-dimensional vector

In ML, vectors represent:

A single data point ([height, weight, age])
An embedding ([0.13, -0.27, ..., 0.04] of length 768 or 1536 or 4096)
Model parameters (weights of one neuron)
A token’s representation at one layer of a transformer

Operations

Addition — element-wise, same shape:

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a + b   # [5, 7, 9]

Scalar multiplication — scales all elements:

2 * a   # [2, 4, 6]

Dot product — sum of element-wise products:

np.dot(a, b)   # 1*4 + 2*5 + 3*6 = 32
a @ b          # same thing, modern syntax

The dot product has two interpretations, both essential:

Algebraic: a · b = Σ aᵢbᵢ
Geometric: a · b = |a| |b| cos θ where θ is the angle between them

The geometric interpretation is why dot product = similarity: vectors pointing the same way have a high dot product; orthogonal vectors have zero.

Norm (length): |v| = √(v · v). The L2 norm. Other norms exist (L1 = sum of absolute values; L∞ = max element); L2 is by far the most common.

Cosine similarity: dot product, normalized by lengths.

def cosine(a, b):
    return (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))

Range: [−1, 1]. Used everywhere in retrieval, RAG, and embedding-based search.

Matrices

A matrix is a 2D array. Shape (m, n) = m rows, n columns.

M = np.array([[1, 2, 3],
              [4, 5, 6]])   # shape (2, 3)

In ML, matrices represent:

A batch of vectors (rows = examples, columns = features)
A linear transformation (a function from one vector space to another)
A weight matrix in a neural network layer

Matrix-vector product

Mv applies the transformation M to the vector v. If M is (m, n) and v is (n,), the result is (m,).

Mechanically: each row of M dot-products with v to produce one element of the output.

M = np.array([[1, 2], [3, 4], [5, 6]])  # (3, 2)
v = np.array([1, 1])                      # (2,)
M @ v   # [3, 7, 11], shape (3,)

This is one fully-connected layer in a neural network: output = W @ input + b.

Matrix-matrix product

AB where A is (m, k) and B is (k, n) produces (m, n). The inner dimensions must match.

Each entry (AB)ᵢⱼ is the dot product of row i of A with column j of B.

A = np.random.randn(64, 128)
B = np.random.randn(128, 256)
C = A @ B   # (64, 256)

In a transformer, this is how a batch of token embeddings (64 tokens × 128 features) gets transformed by a weight matrix into a different feature space.

Pitfall: matrix multiplication is not commutative. AB ≠ BA in general (and may not even be defined in both orders).

Transpose

Aᵀ swaps rows and columns. Shape (m, n) becomes (n, m).

A.T

Used everywhere — most importantly in softmax(QKᵀ/√d)V in attention.

Special matrices

Identity I: ones on the diagonal, zeros elsewhere. IA = AI = A.
Diagonal: nonzero only on the diagonal. Easy to invert.
Symmetric: A = Aᵀ. Has real eigenvalues.
Orthogonal: AᵀA = I. Preserves lengths and angles. Rotation/reflection.

Inverse

A⁻¹ undoes A: A A⁻¹ = I. Only square matrices can have an inverse, and only if they’re non-singular (determinant ≠ 0).

In practice, don’t compute matrix inverses in ML code. Solve linear systems with np.linalg.solve(A, b) instead — faster and more numerically stable.

Eigenvectors and eigenvalues

For a square matrix A, an eigenvector v satisfies Av = λv for some scalar λ (the eigenvalue). The matrix only stretches v; it doesn’t change its direction.

Why care? Eigendecomposition reveals the principal axes of a transformation. PCA, spectral clustering, and parts of optimization theory all rest on it.

eigvals, eigvecs = np.linalg.eig(A)

For symmetric matrices (like covariance matrices), the eigenvectors are orthogonal and the eigenvalues are real — both very nice properties.

Singular value decomposition (SVD)

A = U Σ Vᵀ for any matrix. The most general decomposition there is. Used in:

PCA (the singular values are the standard deviations along principal axes)
LoRA fine-tuning (low-rank approximation by truncating small singular values)
Embedding compression
Numerical-stability tricks in optimization

U, S, Vt = np.linalg.svd(A)

The singular values in S are non-negative and sorted in decreasing order. Keep the top k → get the best rank-k approximation of A. This is the heart of LoRA (Stage 10).

Broadcasting

NumPy and PyTorch silently expand smaller arrays to match larger ones when shapes are compatible. This is broadcasting — it makes ML code concise but is a common source of bugs.

Rules: align shapes from the right. Each pair of dimensions must be equal, or one must be 1.

M = np.ones((3, 4))
v = np.array([1, 2, 3, 4])    # shape (4,)
M + v   # works: v is broadcast across 3 rows

Pitfall: broadcasting + an unintended dimension can silently produce wrong results without an error. Always print .shape.

Tensors

A tensor is the generalization: 0-D = scalar, 1-D = vector, 2-D = matrix, 3-D and beyond = “tensor” colloquially.

In ML, common tensor shapes:

(batch, features) — classical ML
(batch, channels, height, width) — vision
(batch, sequence_length, hidden_size) — language models
(batch, heads, sequence_length, head_dim) — multi-head attention

PyTorch and JAX are tensor-first frameworks; NumPy works fine for learning.

Why this matters for transformers

Self-attention, the central operation in modern AI, is:

attention(Q, K, V) = softmax(QKᵀ / √d_k) V

Every symbol there is from this chapter:

Q, K, V are matrices: (seq_len, d_k)
QKᵀ is matrix multiplication producing a (seq_len, seq_len) similarity matrix
√d_k is a scalar normalizing for stable gradients
softmax turns each row into a probability distribution
The final matmul with V produces the new representations

If QKᵀ and matrix multiplication feel solid, attention is just a recipe.

Exercises

Verify dot product = projection. Take a = [3, 4] and b = [1, 0]. Compute a·b. Then compute |a| |b| cos θ where θ is the angle between them. Match.
Implement matmul. Write matmul(A, B) from scratch in pure Python (no NumPy). Check against A @ B.
PCA from SVD. Take 1000 random 2D points stretched along an axis. SVD them. The first right singular vector should align with that axis.
Cosine vs Euclidean. Generate three vectors. Rank pairs by cosine similarity and by Euclidean distance. Are the rankings the same? Why not?