Unsupervised Learning
No labels — find structure in the data itself. Three classical problems: clustering, dimensionality reduction, density estimation. Modern self-supervised learning grew out of this lineage.
Clustering
Group similar items.
k-means
Pick k cluster centers; assign each point to nearest; recompute centers; repeat.
from sklearn.cluster import KMeans
km = KMeans(n_clusters=8).fit(X)
labels = km.labels_
- Fast and simple, but assumes spherical clusters of similar size.
- Sensitive to initialization (
k-means++is the standard fix). - You have to choose
k(use the elbow method or silhouette score).
DBSCAN, HDBSCAN
Density-based: clusters are dense regions; points in sparse regions are noise.
- Doesn’t require choosing
k. - Handles non-spherical clusters and outliers naturally.
- HDBSCAN adapts to varying density — usually preferred for embedding clustering.
Hierarchical clustering
Build a tree of merges (agglomerative) or splits (divisive). The dendrogram gives you any number of clusters by cutting at different heights.
Dimensionality reduction
Compress high-dimensional data while preserving structure.
PCA (Principal Component Analysis)
Find the orthogonal directions of maximum variance. Project data onto the top k.
- Linear method.
- Optimal for Gaussian-like data.
- Used for visualization, denoising, feature decorrelation.
from sklearn.decomposition import PCA
X_2d = PCA(n_components=2).fit_transform(X)
t-SNE
Non-linear, focuses on preserving local neighborhoods. Great for visualization (2D plots that actually show clusters).
Pitfall: distances in t-SNE plots are not meaningful. Two close clusters might be far in the original space. Use t-SNE for “do these clusters separate?”, not “how far apart are they?”
UMAP
Successor to t-SNE for most purposes. Faster, preserves more global structure, has solid theoretical grounding (manifold learning).
import umap
reducer = umap.UMAP(n_components=2)
X_2d = reducer.fit_transform(X)
UMAP and HDBSCAN together are the modern default for “I have embeddings, find clusters and visualize them.”
Autoencoders
A neural network that compresses and reconstructs. The bottleneck is your low-dim representation.
input → encoder → latent → decoder → reconstruction
Trained to minimize reconstruction error. Variants: denoising AE, variational AE (VAE), masked AE. Modern foundation models (BERT, MAE) are essentially huge autoencoders.
Density estimation
Estimate P(x) itself.
- Kernel density estimation (KDE): place a small bump at each data point; sum.
- Gaussian mixture models (GMM): fit a weighted sum of Gaussians via EM.
- Normalizing flows / diffusion models / autoregressive models: modern, used in generative AI (Stage 12).
A well-calibrated density model lets you do anomaly detection, generate samples, and compute likelihoods.
Self-supervised learning (the modern unsupervised)
Generate labels from the structure of the data itself:
- Masked language modeling (BERT): mask tokens, predict them.
- Next-token prediction (GPT): predict the next token.
- Contrastive learning (SimCLR, CLIP): pull similar pairs together, push different pairs apart in embedding space.
- Masked image modeling (MAE): mask patches of an image, reconstruct.
This is what trains foundation models. “Unsupervised” in the textbook sense is mostly classical now; self-supervised is where the action is. We dive into it in Stages 04–06.
When to use unsupervised
- Exploration: you have raw logs/embeddings/text and want to see what’s there.
- Pre-processing: PCA before a classifier; clustering for stratified sampling.
- Anomaly detection: model normal, flag outliers.
- Initialization for downstream tasks: pretrain unsupervised, fine-tune supervised.
Practical advice
- For embedding clustering: UMAP + HDBSCAN by default.
- For tabular dim reduction: PCA first, then ask if you need anything fancier.
- Always scale features before clustering or PCA. Distance-based methods fall apart on un-scaled mixed-unit data.
See also
- Stage 05 — Embeddings — where unsupervised representation learning becomes useful
- Stage 09 — RAG retrieval — clustering helps with retrieval analysis