demo
One embedding space, two modalities.
CLIP learns a single 512-d space where images and the text that describes them land near each other. Once you have that, you get zero-shot classification, reverse search, and the conditioning backbone of every modern image generator — for free.
The math
# classification: image → best text label
img_emb = vision_encoder(image) # 512-d, normalized
label_embs = text_encoder(["a photo of X", "a photo of Y", ...])
scores = img_emb · label_embsᵀ # dot products
probs = softmax(scores · 100) # 100 = CLIP's logit scale
# retrieval: text → best image (just swap directions)
caption_emb = text_encoder(query)
image_embs = vision_encoder(images)
scores = caption_emb · image_embsᵀ Try this — predict before you click
- Pick an image (e.g., a cat photo). Look at the ranked label list. Predict: the correct label has > 50% softmax probability; runners-up are semantically related (e.g., "kitten" or "small mammal" cluster behind "cat"). CLIP's zero-shot accuracy on ImageNet is ~76% — surprisingly good for never seeing the labels at training time.
- Try the rocket image with labels "rocket" vs "bottle" vs "tower". Predict: rocket wins decisively. Now imagine adding a label "an apple" — it scores almost zero because the image's visual features aren't apple-shaped.
- Switch to caption → image direction. Try "a happy dog". Predict: top images cluster around dogs in playful poses, even though "happy" was never explicitly trained as a label. CLIP picked up affective associations from caption training data.
- CLIP has a known failure mode: typographic attacks. An image of an apple with the word "iPod" written on it gets classified as "iPod" with high confidence. Predict: if a sample image had visible text, label rankings could shift dramatically based on the text content. This is why production vision models add OCR-aware reranking on top of CLIP.
Anchored to 12-multimodal/vision-language-models.