Unsupervised Learning

No labels — find structure in the data itself. Three classical problems: clustering, dimensionality reduction, density estimation. Modern self-supervised learning grew out of this lineage.

Clustering

Group similar items.

k-means

Pick k cluster centers; assign each point to nearest; recompute centers; repeat.

from sklearn.cluster import KMeans
km = KMeans(n_clusters=8).fit(X)
labels = km.labels_
  • Fast and simple, but assumes spherical clusters of similar size.
  • Sensitive to initialization (k-means++ is the standard fix).
  • You have to choose k (use the elbow method or silhouette score).

DBSCAN, HDBSCAN

Density-based: clusters are dense regions; points in sparse regions are noise.

  • Doesn’t require choosing k.
  • Handles non-spherical clusters and outliers naturally.
  • HDBSCAN adapts to varying density — usually preferred for embedding clustering.

Hierarchical clustering

Build a tree of merges (agglomerative) or splits (divisive). The dendrogram gives you any number of clusters by cutting at different heights.

Dimensionality reduction

Compress high-dimensional data while preserving structure.

PCA (Principal Component Analysis)

Find the orthogonal directions of maximum variance. Project data onto the top k.

  • Linear method.
  • Optimal for Gaussian-like data.
  • Used for visualization, denoising, feature decorrelation.
from sklearn.decomposition import PCA
X_2d = PCA(n_components=2).fit_transform(X)

t-SNE

Non-linear, focuses on preserving local neighborhoods. Great for visualization (2D plots that actually show clusters).

Pitfall: distances in t-SNE plots are not meaningful. Two close clusters might be far in the original space. Use t-SNE for “do these clusters separate?”, not “how far apart are they?”

UMAP

Successor to t-SNE for most purposes. Faster, preserves more global structure, has solid theoretical grounding (manifold learning).

import umap
reducer = umap.UMAP(n_components=2)
X_2d = reducer.fit_transform(X)

UMAP and HDBSCAN together are the modern default for “I have embeddings, find clusters and visualize them.”

Autoencoders

A neural network that compresses and reconstructs. The bottleneck is your low-dim representation.

input → encoder → latent → decoder → reconstruction

Trained to minimize reconstruction error. Variants: denoising AE, variational AE (VAE), masked AE. Modern foundation models (BERT, MAE) are essentially huge autoencoders.

Density estimation

Estimate P(x) itself.

  • Kernel density estimation (KDE): place a small bump at each data point; sum.
  • Gaussian mixture models (GMM): fit a weighted sum of Gaussians via EM.
  • Normalizing flows / diffusion models / autoregressive models: modern, used in generative AI (Stage 12).

A well-calibrated density model lets you do anomaly detection, generate samples, and compute likelihoods.

Self-supervised learning (the modern unsupervised)

Generate labels from the structure of the data itself:

  • Masked language modeling (BERT): mask tokens, predict them.
  • Next-token prediction (GPT): predict the next token.
  • Contrastive learning (SimCLR, CLIP): pull similar pairs together, push different pairs apart in embedding space.
  • Masked image modeling (MAE): mask patches of an image, reconstruct.

This is what trains foundation models. “Unsupervised” in the textbook sense is mostly classical now; self-supervised is where the action is. We dive into it in Stages 04–06.

When to use unsupervised

  • Exploration: you have raw logs/embeddings/text and want to see what’s there.
  • Pre-processing: PCA before a classifier; clustering for stratified sampling.
  • Anomaly detection: model normal, flag outliers.
  • Initialization for downstream tasks: pretrain unsupervised, fine-tune supervised.

Practical advice

  • For embedding clustering: UMAP + HDBSCAN by default.
  • For tabular dim reduction: PCA first, then ask if you need anything fancier.
  • Always scale features before clustering or PCA. Distance-based methods fall apart on un-scaled mixed-unit data.

See also