Semantic Geometry

Embeddings are vectors in some d-dimensional space. The geometry of that space — distances, directions, clusters — encodes meaning. Understanding this geometry is what lets you reason about retrieval, clustering, and the quirks of embedding-based systems.

The basic claim

In a well-trained embedding space:

  • Similar things are close.
  • Different things are far apart.
  • Directions encode attributes.

This is the “semantic geometry” of the space. None of it is an axiom — it’s an emergent property of training.

Cosine vs Euclidean similarity

For embedding similarity, cosine is the default. Why?

Cosine similarity of a and b:

cos(a, b) = (a · b) / (|a| · |b|)

It only depends on the direction of the vectors, not their magnitude. Embedding magnitudes are often arbitrary (depend on training quirks); direction encodes meaning.

Euclidean distance:

d(a, b) = |a − b|

In high dimensions, Euclidean distance becomes less informative — see “curse of dimensionality” below.

Practical note: if you L2-normalize embeddings (set |a| = 1), then cosine similarity and Euclidean distance are equivalent in ranking. Most embedding APIs return normalized vectors.

High-dimensional intuition

Geometry in high dimensions defies low-dimensional intuition.

  • In 1000 dimensions, almost all random vectors are nearly orthogonal.
  • The volume of a high-dim sphere is concentrated near its surface.
  • “Distance” loses contrast — the ratio of nearest to farthest distances → 1.

What this means for embeddings:

  • High-dim spaces have room for many distinctions. A 1024-dim embedding can encode many independent attributes orthogonally.
  • Small differences in cosine similarity are meaningful (0.85 vs 0.80 can be a big gap).
  • Visualizations (t-SNE, UMAP) project to 2D — useful for exploration but lossy.

The compositional structure

Famously, in good embedding spaces:

king − man + woman ≈ queen

This works because the model learned a roughly linear “gender” direction. Subtracting “man” and adding “woman” moves you along that direction.

It’s not perfect; it’s not a rigorous algebra. But it shows that embeddings can encode attributes as directions, not just positions.

In practice:

  • Sentiment direction
  • Sentiment intensity direction
  • Formal-vs-casual direction
  • Time direction (for temporal embeddings)

You can sometimes find these directions and use them for editing or controlled generation.

Anisotropy

Trained embedding spaces are often anisotropic — not uniformly distributed. Vectors tend to cluster in a narrow cone, making cosine similarities all look high.

If average pairwise cosine of random pairs is 0.6 instead of ~0, your space is anisotropic. You’ll see this in pretrained encoder models without contrastive fine-tuning.

Mitigations:

  • Whitening: rotate and scale the space to be isotropic. Simple post-processing.
  • Contrastive fine-tuning: the standard fix; modern embedding models do this.

Hubness

In high-dim retrieval, some “hub” points are nearest neighbor to many queries — even when not relevant. They distort retrieval.

Mitigations:

  • Mutual neighbor reranking
  • Hubness-aware reranking strategies
  • Often baked into modern embedding training

The sphere model

For L2-normalized embeddings, the space is the unit sphere S^{d-1}. Two consequences:

  1. Cosine similarity is just 1 − dist²/2 after normalization.
  2. The “volume” of similarity at a given threshold shrinks rapidly — there’s a lot of room near “moderately similar.”

This is why retrieval@k metrics matter: at small k, you’re carving out a small cone of nearest neighbors. The space outside that cone is vast and full of irrelevant stuff.

Locality-sensitive hashing and ANN

Exact nearest-neighbor search is O(n) per query. Approximate nearest neighbor (ANN) algorithms exploit the geometry:

  • HNSW: Hierarchical Navigable Small World graphs. Default in most vector DBs.
  • IVF: Inverted File Index — partition space into Voronoi cells, search only relevant cells.
  • PQ (Product Quantization): compress vectors into small codes for memory and speed.
  • LSH: Locality-sensitive hashing — cheap, simple, less common now.

For most use cases, HNSW is the right default. PQ is added when memory becomes a bottleneck.

Embedding “drift” over time

If you re-train your embedding model, the new vectors are in a different space — not comparable to the old ones. Practical implications:

  • Don’t mix vectors from different models.
  • When you upgrade an embedding model, re-embed everything.
  • Some applications version their embedding space and migrate carefully.

Calibrating similarity scores

A cosine similarity of 0.83 doesn’t mean “83% similar” in any human sense. It depends on:

  • The model
  • The domain
  • How “similar” was defined during training (paraphrase? topic? entailment?)

Always calibrate on a small labeled set before setting thresholds. “Above 0.7 = relevant” is a guess until you’ve checked.

Failure modes to know

  • Length bias: some models give longer documents systematically higher similarity. Check.
  • Translation drift: multilingual models sometimes embed translations slightly farther than synonyms within one language.
  • Style over content: an embedding might score “formal text about cats” close to “formal text about dogs” because of style, not topic.
  • Out-of-domain collapse: queries far from the training distribution may get similar scores for everything.

The fix is usually: contrastive fine-tune on your domain, or use a model trained on similar data.

See also