CLIP Zero-Shot · ai-explained

The math

# classification: image → best text label
img_emb     = vision_encoder(image)       # 512-d, normalized
label_embs  = text_encoder(["a photo of X", "a photo of Y", ...])
scores      = img_emb · label_embsᵀ       # dot products
probs       = softmax(scores · 100)        # 100 = CLIP's logit scale

# retrieval: text → best image (just swap directions)
caption_emb = text_encoder(query)
image_embs  = vision_encoder(images)
scores      = caption_emb · image_embsᵀ

Try this — predict before you click

Pick an image (e.g., a cat photo). Look at the ranked label list. Predict: the correct label has > 50% softmax probability; runners-up are semantically related (e.g., "kitten" or "small mammal" cluster behind "cat"). CLIP's zero-shot accuracy on ImageNet is ~76% — surprisingly good for never seeing the labels at training time.
Try the rocket image with labels "rocket" vs "bottle" vs "tower". Predict: rocket wins decisively. Now imagine adding a label "an apple" — it scores almost zero because the image's visual features aren't apple-shaped.
Switch to caption → image direction. Try "a happy dog". Predict: top images cluster around dogs in playful poses, even though "happy" was never explicitly trained as a label. CLIP picked up affective associations from caption training data.
CLIP has a known failure mode: typographic attacks. An image of an apple with the word "iPod" written on it gets classified as "iPod" with high confidence. Predict: if a sample image had visible text, label rankings could shift dramatically based on the text content. This is why production vision models add OCR-aware reranking on top of CLIP.

Anchored to 12-multimodal/vision-language-models.

One embedding space, two modalities.

The math

Try this — predict before you click