Vision-Language Models (VLMs)

A vision-language model takes images and text as input and produces text as output. It can describe images, answer questions about them, read documents, control browsers — anything an LLM can do, conditioned on visual context.

By 2026, every frontier model has vision built in.

Architecture

The dominant pattern (LLaVA-style):

Image
  ↓ Vision encoder (e.g. CLIP ViT-L)
  ↓ Project (linear or MLP) into LLM token space

LLM ← Concatenate with text tokens

Text output

Three components:

  1. Vision encoder: turns an image into a sequence of visual tokens (one per patch).
  2. Projector: maps visual tokens into the LLM’s embedding space.
  3. LLM: processes the combined sequence, generates text.

Variants differ in:

  • Encoder: ViT, ConvNeXt, custom pretrained.
  • Projector: linear, MLP, Q-Former (a small transformer with learned queries that “summarize” the image).
  • LLM: any decoder; often LLaMA, Qwen, Mistral.
  • Native multimodal training: some models (GPT-4o, Gemini) train on multimodal data from scratch rather than bolting vision onto a text-only LLM.

Frontier VLMs (early 2026)

Closed

  • Claude Sonnet 4.6 / Opus 4.6 / 4.7: strong document understanding, charts, layouts.
  • GPT-4o / 4.1 / 4.5: native multimodal, image + audio + text.
  • Gemini 2.x / 3.x: very long context, strong on video.
  • Mistral Pixtral (closed variants).

Open

  • Qwen2.5-VL / Qwen3-VL: state-of-the-art open VLMs.
  • LLaVA-OneVision / LLaVA-NeXT: research-grade, well-supported.
  • InternVL 2/3: strong open-source competitor.
  • Pixtral 12B / Large (Mistral): open variants.
  • Llama-3.2 Vision: Meta’s first vision-enabled Llamas.
  • Phi-4-Vision: small (~5B) but strong.
  • MolmoE, Idefics3, etc.: research models.

The open-vs-closed gap on VLM tasks is small in 2026 — open models match closed for most use cases on standard benchmarks.

What VLMs are good at

  • Image captioning: describe what’s in an image.
  • Visual question answering (VQA): “What color is the car?”
  • Document understanding: read PDFs, parse tables, follow layout cues.
  • OCR: extract text, even from photos and screenshots.
  • Chart and diagram interpretation: read bar charts, flow diagrams.
  • UI understanding: parse screenshots of apps; useful for browser/desktop agents.
  • Visual reasoning: multi-step questions about visual content.
  • Code from screenshots: regenerate UI mock-ups as code.

What VLMs still struggle with

  • Counting: precise counts of small objects.
  • Spatial reasoning: “to the left of,” “behind,” precise relations.
  • Fine-grained recognition: distinguishing similar species, brands, models.
  • Very small text: low-resolution OCR.
  • Counterfactual reasoning over images: “if the chair were red, would the room look balanced?”
  • Long-form video: most VLMs struggle past ~1 minute of video.

These gaps are closing rapidly — the 2024 models were much worse than 2026.

Practical patterns

Image input

For most APIs:

# Anthropic
content = [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": b64}},
    {"type": "text", "text": "Describe this image."},
]
# OpenAI
content = [
    {"type": "image_url", "image_url": {"url": data_url}},
    {"type": "text", "text": "Describe this image."},
]

Resolution matters: high-res images cost more tokens and reveal more detail. Provider-specific token-counting rules apply.

Multi-image

Most APIs accept multiple images per turn. Useful for:

  • Comparing photos.
  • Multi-page documents.
  • Before/after visual analysis.

Image + tool use

Combine vision with tools for genuinely new capabilities:

  • Read a screenshot → click on the right button.
  • Analyze a chart → call a calculator with the numbers.
  • Inspect a UI → write code that replicates it.

This is the foundation of browser/desktop agents (Stage 11).

Document AI

A specialty: extracting structured info from PDFs, forms, invoices, contracts.

Pipeline options:

Direct VLM

Pass the document image to a VLM; ask for structured extraction.

Pros: simple, captures layout cues. Cons: cost per page, limited to model’s context window.

Layout-aware parsing → text

Tools: Unstructured, LlamaParse, AWS Textract, Azure Document Intelligence, Mathpix. Extract a structured representation (text + layout + tables); pass to an LLM.

Pros: cheap, scalable. Cons: imperfect parsing, especially for complex layouts.

Hybrid

Layout-aware parser for general extraction; VLM for tricky cases (handwriting, complex tables, charts).

Visual search and image retrieval

For “find this kind of thing”:

  • CLIP-style embeddings (previous article): fast cross-modal similarity.
  • VLM-as-encoder: sometimes higher quality, more expensive.

For “tell me about this thing”:

  • VLM Q&A: ask about specific properties.

Combine: retrieve via embeddings, then VLM-question on the top result.

Fine-tuning VLMs

Same patterns as Stage 10:

  • LoRA on the LLM portion (most common).
  • Optionally LoRA on the vision encoder (rarely needed).
  • Don’t usually need to fine-tune the projector.

Use cases: domain-specific OCR, branded UI understanding, custom labeling tasks.

Tools: TRL supports VLMs (SFTTrainer with image inputs); LLaVA’s training scripts; modern Axolotl variants.

Evaluation

Benchmarks:

  • MMMU: college-exam-level multimodal questions.
  • MathVista: visual math reasoning.
  • DocVQA: document QA.
  • ChartQA: chart understanding.
  • VQAv2: classic VQA.
  • VStar: small-detail / fine-grained tasks.

For your specific domain: build a labeled set of (image, question, answer) triples. Standard benchmark scores tell you the model’s general capabilities, not its fit for your task.

Cost & latency

VLMs are more expensive per call than text-only:

  • Image tokens: 100s to 1000s per image, depending on resolution.
  • Vision encoder forward pass: significant compute.
  • Latency typically 2–5× a comparable text-only call.

Practical:

  • Resize images to the smallest resolution that captures what you need.
  • Cache image-derived results (descriptions, extractions) when reusable.
  • Batch when possible.

Pitfalls

  • Trusting the VLM to count: verify with code if precision matters.
  • PII in images: faces, ID cards, screenshots may contain sensitive info. Detect and redact.
  • Misleading layouts: a chart with mislabeled axes will be misread.
  • Subtle text: text rendered as part of an image (memes, screenshots) may need higher resolution.

See also