Semantic Geometry
Embeddings are vectors in some d-dimensional space. The geometry of that space — distances, directions, clusters — encodes meaning. Understanding this geometry is what lets you reason about retrieval, clustering, and the quirks of embedding-based systems.
The basic claim
In a well-trained embedding space:
- Similar things are close.
- Different things are far apart.
- Directions encode attributes.
This is the “semantic geometry” of the space. None of it is an axiom — it’s an emergent property of training.
Cosine vs Euclidean similarity
For embedding similarity, cosine is the default. Why?
Cosine similarity of a and b:
cos(a, b) = (a · b) / (|a| · |b|)
It only depends on the direction of the vectors, not their magnitude. Embedding magnitudes are often arbitrary (depend on training quirks); direction encodes meaning.
Euclidean distance:
d(a, b) = |a − b|
In high dimensions, Euclidean distance becomes less informative — see “curse of dimensionality” below.
Practical note: if you L2-normalize embeddings (set
|a| = 1), then cosine similarity and Euclidean distance are equivalent in ranking. Most embedding APIs return normalized vectors.
High-dimensional intuition
Geometry in high dimensions defies low-dimensional intuition.
- In 1000 dimensions, almost all random vectors are nearly orthogonal.
- The volume of a high-dim sphere is concentrated near its surface.
- “Distance” loses contrast — the ratio of nearest to farthest distances → 1.
What this means for embeddings:
- High-dim spaces have room for many distinctions. A 1024-dim embedding can encode many independent attributes orthogonally.
- Small differences in cosine similarity are meaningful (0.85 vs 0.80 can be a big gap).
- Visualizations (t-SNE, UMAP) project to 2D — useful for exploration but lossy.
The compositional structure
Famously, in good embedding spaces:
king − man + woman ≈ queen
This works because the model learned a roughly linear “gender” direction. Subtracting “man” and adding “woman” moves you along that direction.
It’s not perfect; it’s not a rigorous algebra. But it shows that embeddings can encode attributes as directions, not just positions.
In practice:
- Sentiment direction
- Sentiment intensity direction
- Formal-vs-casual direction
- Time direction (for temporal embeddings)
You can sometimes find these directions and use them for editing or controlled generation.
Anisotropy
Trained embedding spaces are often anisotropic — not uniformly distributed. Vectors tend to cluster in a narrow cone, making cosine similarities all look high.
If average pairwise cosine of random pairs is 0.6 instead of ~0, your space is anisotropic. You’ll see this in pretrained encoder models without contrastive fine-tuning.
Mitigations:
- Whitening: rotate and scale the space to be isotropic. Simple post-processing.
- Contrastive fine-tuning: the standard fix; modern embedding models do this.
Hubness
In high-dim retrieval, some “hub” points are nearest neighbor to many queries — even when not relevant. They distort retrieval.
Mitigations:
- Mutual neighbor reranking
- Hubness-aware reranking strategies
- Often baked into modern embedding training
The sphere model
For L2-normalized embeddings, the space is the unit sphere S^{d-1}. Two consequences:
- Cosine similarity is just
1 − dist²/2after normalization. - The “volume” of similarity at a given threshold shrinks rapidly — there’s a lot of room near “moderately similar.”
This is why retrieval@k metrics matter: at small k, you’re carving out a small cone of nearest neighbors. The space outside that cone is vast and full of irrelevant stuff.
Locality-sensitive hashing and ANN
Exact nearest-neighbor search is O(n) per query. Approximate nearest neighbor (ANN) algorithms exploit the geometry:
- HNSW: Hierarchical Navigable Small World graphs. Default in most vector DBs.
- IVF: Inverted File Index — partition space into Voronoi cells, search only relevant cells.
- PQ (Product Quantization): compress vectors into small codes for memory and speed.
- LSH: Locality-sensitive hashing — cheap, simple, less common now.
For most use cases, HNSW is the right default. PQ is added when memory becomes a bottleneck.
Embedding “drift” over time
If you re-train your embedding model, the new vectors are in a different space — not comparable to the old ones. Practical implications:
- Don’t mix vectors from different models.
- When you upgrade an embedding model, re-embed everything.
- Some applications version their embedding space and migrate carefully.
Calibrating similarity scores
A cosine similarity of 0.83 doesn’t mean “83% similar” in any human sense. It depends on:
- The model
- The domain
- How “similar” was defined during training (paraphrase? topic? entailment?)
Always calibrate on a small labeled set before setting thresholds. “Above 0.7 = relevant” is a guess until you’ve checked.
Failure modes to know
- Length bias: some models give longer documents systematically higher similarity. Check.
- Translation drift: multilingual models sometimes embed translations slightly farther than synonyms within one language.
- Style over content: an embedding might score “formal text about cats” close to “formal text about dogs” because of style, not topic.
- Out-of-domain collapse: queries far from the training distribution may get similar scores for everything.
The fix is usually: contrastive fine-tune on your domain, or use a model trained on similar data.