Vision-Language Models (VLMs)
A vision-language model takes images and text as input and produces text as output. It can describe images, answer questions about them, read documents, control browsers — anything an LLM can do, conditioned on visual context.
By 2026, every frontier model has vision built in.
Architecture
The dominant pattern (LLaVA-style):
Image
↓ Vision encoder (e.g. CLIP ViT-L)
↓ Project (linear or MLP) into LLM token space
↓
LLM ← Concatenate with text tokens
↓
Text output
Three components:
- Vision encoder: turns an image into a sequence of visual tokens (one per patch).
- Projector: maps visual tokens into the LLM’s embedding space.
- LLM: processes the combined sequence, generates text.
Variants differ in:
- Encoder: ViT, ConvNeXt, custom pretrained.
- Projector: linear, MLP, Q-Former (a small transformer with learned queries that “summarize” the image).
- LLM: any decoder; often LLaMA, Qwen, Mistral.
- Native multimodal training: some models (GPT-4o, Gemini) train on multimodal data from scratch rather than bolting vision onto a text-only LLM.
Frontier VLMs (early 2026)
Closed
- Claude Sonnet 4.6 / Opus 4.6 / 4.7: strong document understanding, charts, layouts.
- GPT-4o / 4.1 / 4.5: native multimodal, image + audio + text.
- Gemini 2.x / 3.x: very long context, strong on video.
- Mistral Pixtral (closed variants).
Open
- Qwen2.5-VL / Qwen3-VL: state-of-the-art open VLMs.
- LLaVA-OneVision / LLaVA-NeXT: research-grade, well-supported.
- InternVL 2/3: strong open-source competitor.
- Pixtral 12B / Large (Mistral): open variants.
- Llama-3.2 Vision: Meta’s first vision-enabled Llamas.
- Phi-4-Vision: small (~5B) but strong.
- MolmoE, Idefics3, etc.: research models.
The open-vs-closed gap on VLM tasks is small in 2026 — open models match closed for most use cases on standard benchmarks.
What VLMs are good at
- Image captioning: describe what’s in an image.
- Visual question answering (VQA): “What color is the car?”
- Document understanding: read PDFs, parse tables, follow layout cues.
- OCR: extract text, even from photos and screenshots.
- Chart and diagram interpretation: read bar charts, flow diagrams.
- UI understanding: parse screenshots of apps; useful for browser/desktop agents.
- Visual reasoning: multi-step questions about visual content.
- Code from screenshots: regenerate UI mock-ups as code.
What VLMs still struggle with
- Counting: precise counts of small objects.
- Spatial reasoning: “to the left of,” “behind,” precise relations.
- Fine-grained recognition: distinguishing similar species, brands, models.
- Very small text: low-resolution OCR.
- Counterfactual reasoning over images: “if the chair were red, would the room look balanced?”
- Long-form video: most VLMs struggle past ~1 minute of video.
These gaps are closing rapidly — the 2024 models were much worse than 2026.
Practical patterns
Image input
For most APIs:
# Anthropic
content = [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": b64}},
{"type": "text", "text": "Describe this image."},
]
# OpenAI
content = [
{"type": "image_url", "image_url": {"url": data_url}},
{"type": "text", "text": "Describe this image."},
]
Resolution matters: high-res images cost more tokens and reveal more detail. Provider-specific token-counting rules apply.
Multi-image
Most APIs accept multiple images per turn. Useful for:
- Comparing photos.
- Multi-page documents.
- Before/after visual analysis.
Image + tool use
Combine vision with tools for genuinely new capabilities:
- Read a screenshot → click on the right button.
- Analyze a chart → call a calculator with the numbers.
- Inspect a UI → write code that replicates it.
This is the foundation of browser/desktop agents (Stage 11).
Document AI
A specialty: extracting structured info from PDFs, forms, invoices, contracts.
Pipeline options:
Direct VLM
Pass the document image to a VLM; ask for structured extraction.
Pros: simple, captures layout cues. Cons: cost per page, limited to model’s context window.
Layout-aware parsing → text
Tools: Unstructured, LlamaParse, AWS Textract, Azure Document Intelligence, Mathpix. Extract a structured representation (text + layout + tables); pass to an LLM.
Pros: cheap, scalable. Cons: imperfect parsing, especially for complex layouts.
Hybrid
Layout-aware parser for general extraction; VLM for tricky cases (handwriting, complex tables, charts).
Visual search and image retrieval
For “find this kind of thing”:
- CLIP-style embeddings (previous article): fast cross-modal similarity.
- VLM-as-encoder: sometimes higher quality, more expensive.
For “tell me about this thing”:
- VLM Q&A: ask about specific properties.
Combine: retrieve via embeddings, then VLM-question on the top result.
Fine-tuning VLMs
Same patterns as Stage 10:
- LoRA on the LLM portion (most common).
- Optionally LoRA on the vision encoder (rarely needed).
- Don’t usually need to fine-tune the projector.
Use cases: domain-specific OCR, branded UI understanding, custom labeling tasks.
Tools: TRL supports VLMs (SFTTrainer with image inputs); LLaVA’s training scripts; modern Axolotl variants.
Evaluation
Benchmarks:
- MMMU: college-exam-level multimodal questions.
- MathVista: visual math reasoning.
- DocVQA: document QA.
- ChartQA: chart understanding.
- VQAv2: classic VQA.
- VStar: small-detail / fine-grained tasks.
For your specific domain: build a labeled set of (image, question, answer) triples. Standard benchmark scores tell you the model’s general capabilities, not its fit for your task.
Cost & latency
VLMs are more expensive per call than text-only:
- Image tokens: 100s to 1000s per image, depending on resolution.
- Vision encoder forward pass: significant compute.
- Latency typically 2–5× a comparable text-only call.
Practical:
- Resize images to the smallest resolution that captures what you need.
- Cache image-derived results (descriptions, extractions) when reusable.
- Batch when possible.
Pitfalls
- Trusting the VLM to count: verify with code if precision matters.
- PII in images: faces, ID cards, screenshots may contain sensitive info. Detect and redact.
- Misleading layouts: a chart with mislabeled axes will be misread.
- Subtle text: text rendered as part of an image (memes, screenshots) may need higher resolution.