demo

Meaning becomes geometry

An embedding model maps each piece of text to a 384-dimensional vector. Phrases with similar meaning land near each other; the angle between vectors becomes a similarity score. Pick two phrases, see how the math works on real data.

What you can see

  • Synonyms cluster. "the dog is running fast", "the canine is sprinting", "the puppy moves quickly" all live in one region — cosine similarity stays around 0.7-0.85.
  • Topics form regions. Food phrases cluster. Tech phrases cluster. They don't mix even when the surface sentence is similar.
  • Sentiment isn't always perpendicular. Compare "I had a wonderful day" with "today was absolutely terrible". Their cosine might still be moderate (~0.3-0.5) because they're both about days — embedding models encode topic + sentiment together, not as orthogonal axes.
  • The 2D plot lies a little. PCA collapses 384 dimensions to 2 — most variance gets thrown away. Two dots that look close in 2D might be far in 384D and vice versa. The cosine similarity number always uses the full vectors.

The cosine-similarity math

# for two embeddings a, b ∈ ℝ³⁸⁴:
cos_sim(a, b) = (a · b) / (‖a‖ · ‖b‖)
              = Σᵢ aᵢ · bᵢ
                ─────────────────
                √(Σᵢ aᵢ²) · √(Σᵢ bᵢ²)

# range: [−1, 1]. Production systems normalize embeddings
# to unit length once at index time, so cos_sim simplifies
# to just the dot product (a · b).

Same dot product as the 2D vectors demo, just with 384 dimensions instead of 2. The number you see in this playground uses the full 384-d vectors; the 2D scatter only shows the first 2 PCA components.

Try this — predict before you click

  1. Compare "the cat is sleeping" and "a kitten is napping". Predict: cosine ≈ 0.7–0.85. Synonyms produce high similarity but not 1.0 — the embedder distinguishes "cat" from "kitten" along an age/specificity axis.
  2. Compare "the dog is running fast" and "the database query is slow". Predict: cosine drops below 0.3. Both are valid English sentences with the same shape (subject-verb-adverb), but the topics are unrelated. Embedders encode topic, not just grammar.
  3. Compare "I love this movie" and "I hate this movie". Predict: cosine stays moderately high (~0.5–0.7). Sentiment alone doesn't produce orthogonality — the topic ("this movie") dominates. This is why naive nearest-neighbor doesn't capture sentiment; you need a separate sentiment model.
  4. Click any phrase to make it the anchor, then look at the ranked list. Predict: the top 3–5 are paraphrases or topic-matches; the bottom 3–5 are unrelated. The geometry of the full 384-d space is what makes that ranking robust — way more so than the 2D PCA scatter suggests.

How embeddings get used

Anywhere you need "find similar text". Semantic search (RAG — see RAG Visualizer), deduplication, classification by nearest neighbor, recommendation ("things like this"), clustering. The retrieval half of every production AI app is just embeddings + a fast nearest-neighbor index.

The model

Embeddings here come from sentence-transformers/all-MiniLM-L6-v2 (~80MB, ~22M params, 384 dims). It's small, fast, surprisingly competent. Production systems use bigger embedders (OpenAI text-embedding-3, Cohere embed-v3, BGE-large) when accuracy matters more than latency, but the shape of the output and the cosine-similarity math is identical.

Anchored to 05-tokens-embeddings/semantic-geometry from the learning path.