Multimodal Embeddings (CLIP)

Map images and text into the same vector space. Then “find images that match this caption” is just nearest-neighbor search. This unlocks cross-modal retrieval, classification, and a lot more.

CLIP

OpenAI’s CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021) was the breakthrough. The architecture:

Text encoder (transformer) → text vector
Image encoder (ViT or CNN) → image vector

         Same embedding space

Trained on 400M image-text pairs from the web with a contrastive loss:

  • For each image and its true caption, pull them together.
  • For each image and other captions in the batch, push apart.
  • Symmetric: the same loss in both directions.

The result: a shared embedding space where related images and texts are close.

Why it works

  • The web supplies free supervision. Every alt-text is a (caption, image) pair.
  • Contrastive learning makes the geometry useful — similar things cluster.
  • Both encoders learn rich representations.

CLIP’s image encoder, used alone, became a default visual feature extractor for hundreds of downstream tasks.

Zero-shot classification

Famously, CLIP does zero-shot classification:

labels = ["dog", "cat", "bird"]
text_embeds = clip.encode_text([f"a photo of a {l}" for l in labels])
image_embed = clip.encode_image(image)

scores = image_embed @ text_embeds.T
predicted = labels[scores.argmax()]

No labeled training data needed. Add or remove classes by changing the text prompts.

This was a wake-up call for the field: pretraining on broad web data gave us features that transferred to any visual task, no fine-tuning needed.

Successor models

SigLIP (Google, 2023)

Replaces CLIP’s softmax-based contrastive loss with a sigmoid loss that doesn’t normalize over the batch. Trains better at scale; handles smaller batches gracefully.

EVA-CLIP, OpenCLIP

Open replications and improvements over CLIP. OpenCLIP is the de facto open-source default; EVA-CLIP scales to billions of parameters with strong results.

CoCa (Contrastive Captioners)

Adds a captioning head — produces both contrastive embeddings and generated captions from one model.

BLIP-2 / InstructBLIP

Bridge a frozen vision encoder to a frozen LLM via a small “Q-Former” module. The lineage that led to modern VLMs.

Cohere Embed v4 / Voyage Multimodal-3 / Jina-CLIP-v2

Production multimodal embedders. Often handle text, image, and sometimes audio in one model.

LLM-as-multimodal-encoder

Increasingly, a multimodal LLM (Claude, GPT-4o, Qwen-VL) is used as an embedder by extracting hidden states. Quality often beats CLIP-family models, at higher cost.

Cross-modal retrieval

Index images with their CLIP embeddings; query with text embeddings of the same model.

# Indexing
for img in images:
    db.upsert(id=img.id, vector=clip.encode_image(img))

# Query
query_vec = clip.encode_text("a sunset over mountains")
results = db.knn(query_vec, k=10)

Works for:

  • Photo search by description.
  • Stock-photo discovery.
  • Visual product search.
  • Image-to-image search (flip query and target).

Multilingual considerations

Original CLIP is English-only. For multilingual:

  • Multilingual-CLIP, M-CLIP: multilingual extensions.
  • OpenCLIP with multilingual training data.
  • Voyage / Cohere: multilingual by design.

For non-English use cases, pick a multilingual-trained model — English-only models are surprisingly bad at non-English captions.

Beyond image+text

The CLIP idea generalizes to any pair of modalities:

  • CLAP: contrastive language-audio (sounds + text).
  • VideoCLIP: contrastive language-video.
  • ImageBind (Meta): a single embedding space for image, text, audio, depth, thermal, IMU.

Same recipe; broader scope.

Limitations of CLIP-style embeddings

  • Compositional weakness: CLIP doesn’t reliably understand “a red square on top of a blue circle.” Bag-of-concepts representations.
  • Fine-grained struggles: distinguishing similar bird species or product variants.
  • Text in images: CLIP often ignores the literal text in images.
  • Counting: poor at “five apples” vs “three apples.”

For these gaps, modern VLMs (next article) do better — they actually parse the image.

Practical use

For most teams in 2026:

Use casePick
Production image searchVoyage Multimodal-3 / Cohere Embed v4
Open-weightsOpenCLIP ViT-L or SigLIP
MultilingualVoyage / Cohere multilingual / mCLIP
Pre-cooked features for downstream modelsOpenCLIP
Audio + textCLAP
Many modalities at onceImageBind

Don’t train CLIP from scratch unless you have a very specific reason. Domain-adapt an existing one with contrastive fine-tuning on your data.

Domain adaptation

If a generic multimodal embedder underperforms:

  1. Build pairs from your domain (image + caption, product photo + description).
  2. Fine-tune with contrastive loss using open_clip or sentence-transformers.
  3. Validate on a domain eval set.

Even a few thousand pairs can give a meaningful boost.

See also