Multimodal Embeddings (CLIP)
Map images and text into the same vector space. Then “find images that match this caption” is just nearest-neighbor search. This unlocks cross-modal retrieval, classification, and a lot more.
CLIP
OpenAI’s CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021) was the breakthrough. The architecture:
Text encoder (transformer) → text vector
Image encoder (ViT or CNN) → image vector
↓
Same embedding space
Trained on 400M image-text pairs from the web with a contrastive loss:
- For each image and its true caption, pull them together.
- For each image and other captions in the batch, push apart.
- Symmetric: the same loss in both directions.
The result: a shared embedding space where related images and texts are close.
Why it works
- The web supplies free supervision. Every alt-text is a (caption, image) pair.
- Contrastive learning makes the geometry useful — similar things cluster.
- Both encoders learn rich representations.
CLIP’s image encoder, used alone, became a default visual feature extractor for hundreds of downstream tasks.
Zero-shot classification
Famously, CLIP does zero-shot classification:
labels = ["dog", "cat", "bird"]
text_embeds = clip.encode_text([f"a photo of a {l}" for l in labels])
image_embed = clip.encode_image(image)
scores = image_embed @ text_embeds.T
predicted = labels[scores.argmax()]
No labeled training data needed. Add or remove classes by changing the text prompts.
This was a wake-up call for the field: pretraining on broad web data gave us features that transferred to any visual task, no fine-tuning needed.
Successor models
SigLIP (Google, 2023)
Replaces CLIP’s softmax-based contrastive loss with a sigmoid loss that doesn’t normalize over the batch. Trains better at scale; handles smaller batches gracefully.
EVA-CLIP, OpenCLIP
Open replications and improvements over CLIP. OpenCLIP is the de facto open-source default; EVA-CLIP scales to billions of parameters with strong results.
CoCa (Contrastive Captioners)
Adds a captioning head — produces both contrastive embeddings and generated captions from one model.
BLIP-2 / InstructBLIP
Bridge a frozen vision encoder to a frozen LLM via a small “Q-Former” module. The lineage that led to modern VLMs.
Cohere Embed v4 / Voyage Multimodal-3 / Jina-CLIP-v2
Production multimodal embedders. Often handle text, image, and sometimes audio in one model.
LLM-as-multimodal-encoder
Increasingly, a multimodal LLM (Claude, GPT-4o, Qwen-VL) is used as an embedder by extracting hidden states. Quality often beats CLIP-family models, at higher cost.
Cross-modal retrieval
Index images with their CLIP embeddings; query with text embeddings of the same model.
# Indexing
for img in images:
db.upsert(id=img.id, vector=clip.encode_image(img))
# Query
query_vec = clip.encode_text("a sunset over mountains")
results = db.knn(query_vec, k=10)
Works for:
- Photo search by description.
- Stock-photo discovery.
- Visual product search.
- Image-to-image search (flip query and target).
Multilingual considerations
Original CLIP is English-only. For multilingual:
- Multilingual-CLIP, M-CLIP: multilingual extensions.
- OpenCLIP with multilingual training data.
- Voyage / Cohere: multilingual by design.
For non-English use cases, pick a multilingual-trained model — English-only models are surprisingly bad at non-English captions.
Beyond image+text
The CLIP idea generalizes to any pair of modalities:
- CLAP: contrastive language-audio (sounds + text).
- VideoCLIP: contrastive language-video.
- ImageBind (Meta): a single embedding space for image, text, audio, depth, thermal, IMU.
Same recipe; broader scope.
Limitations of CLIP-style embeddings
- Compositional weakness: CLIP doesn’t reliably understand “a red square on top of a blue circle.” Bag-of-concepts representations.
- Fine-grained struggles: distinguishing similar bird species or product variants.
- Text in images: CLIP often ignores the literal text in images.
- Counting: poor at “five apples” vs “three apples.”
For these gaps, modern VLMs (next article) do better — they actually parse the image.
Practical use
For most teams in 2026:
| Use case | Pick |
|---|---|
| Production image search | Voyage Multimodal-3 / Cohere Embed v4 |
| Open-weights | OpenCLIP ViT-L or SigLIP |
| Multilingual | Voyage / Cohere multilingual / mCLIP |
| Pre-cooked features for downstream models | OpenCLIP |
| Audio + text | CLAP |
| Many modalities at once | ImageBind |
Don’t train CLIP from scratch unless you have a very specific reason. Domain-adapt an existing one with contrastive fine-tuning on your data.
Domain adaptation
If a generic multimodal embedder underperforms:
- Build pairs from your domain (image + caption, product photo + description).
- Fine-tune with contrastive loss using
open_clipor sentence-transformers. - Validate on a domain eval set.
Even a few thousand pairs can give a meaningful boost.