Vision-Language Models (VLMs)

A vision-language model takes images and text as input and produces text as output. It can describe images, answer questions about them, read documents, control browsers — anything an LLM can do, conditioned on visual context.

By 2026, every frontier model has vision built in.

Architecture

The dominant pattern (LLaVA-style):

Image
  ↓ Vision encoder (e.g. CLIP ViT-L)
  ↓ Project (linear or MLP) into LLM token space
  ↓
LLM ← Concatenate with text tokens
  ↓
Text output

Three components:

Vision encoder: turns an image into a sequence of visual tokens (one per patch).
Projector: maps visual tokens into the LLM’s embedding space.
LLM: processes the combined sequence, generates text.

Variants differ in:

Encoder: ViT, ConvNeXt, custom pretrained.
Projector: linear, MLP, Q-Former (a small transformer with learned queries that “summarize” the image).
LLM: any decoder; often LLaMA, Qwen, Mistral.
Native multimodal training: some models (GPT-4o, Gemini) train on multimodal data from scratch rather than bolting vision onto a text-only LLM.

Frontier VLMs (early 2026)

Closed

Claude Sonnet 4.6 / Opus 4.6 / 4.7: strong document understanding, charts, layouts.
GPT-4o / 4.1 / 4.5: native multimodal, image + audio + text.
Gemini 2.x / 3.x: very long context, strong on video.
Mistral Pixtral (closed variants).

Open

Qwen2.5-VL / Qwen3-VL: state-of-the-art open VLMs.
LLaVA-OneVision / LLaVA-NeXT: research-grade, well-supported.
InternVL 2/3: strong open-source competitor.
Pixtral 12B / Large (Mistral): open variants.
Llama-3.2 Vision: Meta’s first vision-enabled Llamas.
Phi-4-Vision: small (~5B) but strong.
MolmoE, Idefics3, etc.: research models.

The open-vs-closed gap on VLM tasks is small in 2026 — open models match closed for most use cases on standard benchmarks.

What VLMs are good at

Image captioning: describe what’s in an image.
Visual question answering (VQA): “What color is the car?”
Document understanding: read PDFs, parse tables, follow layout cues.
OCR: extract text, even from photos and screenshots.
Chart and diagram interpretation: read bar charts, flow diagrams.
UI understanding: parse screenshots of apps; useful for browser/desktop agents.
Visual reasoning: multi-step questions about visual content.
Code from screenshots: regenerate UI mock-ups as code.

What VLMs still struggle with

Counting: precise counts of small objects.
Spatial reasoning: “to the left of,” “behind,” precise relations.
Fine-grained recognition: distinguishing similar species, brands, models.
Very small text: low-resolution OCR.
Counterfactual reasoning over images: “if the chair were red, would the room look balanced?”
Long-form video: most VLMs struggle past ~1 minute of video.

These gaps are closing rapidly — the 2024 models were much worse than 2026.

Practical patterns

Image input

For most APIs:

# Anthropic
content = [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": b64}},
    {"type": "text", "text": "Describe this image."},
]

# OpenAI
content = [
    {"type": "image_url", "image_url": {"url": data_url}},
    {"type": "text", "text": "Describe this image."},
]

Resolution matters: high-res images cost more tokens and reveal more detail. Provider-specific token-counting rules apply.

Multi-image

Most APIs accept multiple images per turn. Useful for:

Comparing photos.
Multi-page documents.
Before/after visual analysis.

Image + tool use

Combine vision with tools for genuinely new capabilities:

Read a screenshot → click on the right button.
Analyze a chart → call a calculator with the numbers.
Inspect a UI → write code that replicates it.

This is the foundation of browser/desktop agents (Stage 11).

Document AI

A specialty: extracting structured info from PDFs, forms, invoices, contracts.

Pipeline options:

Direct VLM

Pass the document image to a VLM; ask for structured extraction.

Pros: simple, captures layout cues. Cons: cost per page, limited to model’s context window.

Layout-aware parsing → text

Tools: Unstructured, LlamaParse, AWS Textract, Azure Document Intelligence, Mathpix. Extract a structured representation (text + layout + tables); pass to an LLM.

Pros: cheap, scalable. Cons: imperfect parsing, especially for complex layouts.

Hybrid

Layout-aware parser for general extraction; VLM for tricky cases (handwriting, complex tables, charts).

Visual search and image retrieval

For “find this kind of thing”:

CLIP-style embeddings (previous article): fast cross-modal similarity.
VLM-as-encoder: sometimes higher quality, more expensive.

For “tell me about this thing”:

VLM Q&A: ask about specific properties.

Combine: retrieve via embeddings, then VLM-question on the top result.

Fine-tuning VLMs

Same patterns as Stage 10:

LoRA on the LLM portion (most common).
Optionally LoRA on the vision encoder (rarely needed).
Don’t usually need to fine-tune the projector.

Use cases: domain-specific OCR, branded UI understanding, custom labeling tasks.

Tools: TRL supports VLMs (SFTTrainer with image inputs); LLaVA’s training scripts; modern Axolotl variants.

Evaluation

Benchmarks:

MMMU: college-exam-level multimodal questions.
MathVista: visual math reasoning.
DocVQA: document QA.
ChartQA: chart understanding.
VQAv2: classic VQA.
VStar: small-detail / fine-grained tasks.

For your specific domain: build a labeled set of (image, question, answer) triples. Standard benchmark scores tell you the model’s general capabilities, not its fit for your task.

Cost & latency

VLMs are more expensive per call than text-only:

Image tokens: 100s to 1000s per image, depending on resolution.
Vision encoder forward pass: significant compute.
Latency typically 2–5× a comparable text-only call.

Practical:

Resize images to the smallest resolution that captures what you need.
Cache image-derived results (descriptions, extractions) when reusable.
Batch when possible.

Pitfalls

Trusting the VLM to count: verify with code if precision matters.
PII in images: faces, ID cards, screenshots may contain sensitive info. Detect and redact.
Misleading layouts: a chart with mislabeled axes will be misread.
Subtle text: text rendered as part of an image (memes, screenshots) may need higher resolution.