12

stage · curriculum

Multimodal

Frontier models in 2026 are natively multimodal — text, image, audio, video. Same transformer skeleton, different tokenizers and decoders. Embed images and text in a shared space and the entire toolkit unfolds.

6 articles
27 min to read
3 demos
4 books
if you only do one thing

CLIP-style embeddings put text and images in the same space. Same retrieval algorithm, two modalities — the entire multimodal toolkit starts here.

Articles in this stage

  1. 01 Multimodal Embeddings (CLIP)
  2. 02 Speech & TTS
  3. 03 Synthetic Data
  4. 04 Text-to-Image Diffusion
  5. 05 Video Generation
  6. 06 Vision-Language Models (VLMs)

Stage 12 — Multimodal AI

Text-only is the past. Frontier models in 2026 are natively multimodal — text, image, audio, video, sometimes 3D and structured data. This stage covers the architectures, the models, and the application patterns.

Prerequisites

  • Stage 05 (embeddings)
  • Stage 06 (transformers)

Learning ladder

  1. Multimodal embeddings (CLIP)
  2. Vision-language models — VLMs, image-to-text, doc understanding
  3. Text-to-image diffusion
  4. Video generation — Sora, Wan, Veo
  5. Speech & TTS — Whisper, modern TTS, real-time voice
  6. Synthetic data

MVU

You can:

  • Pick the right multimodal stack for a use case (search, generation, understanding)
  • Embed images and text in a shared space and retrieve across modalities
  • Distinguish a vision-language encoder from a generator
  • Articulate when to use a diffusion model vs an autoregressive one

Exercise

Build a “search my photos by description” feature. Use CLIP-style embeddings to embed both images and queries; retrieve. Then extend to “answer questions about my photos” using a vision-language model.

See also

Further reading

Books move slower than papers in this field — treat these as foundations, not replacements for the latest research. Real authors, real publishers, real editions. Free badges mark books with author-authorized full text online.

  1. ★ start here
    Computer Vision: Algorithms and Applications coverfree

    Computer Vision: Algorithms and Applications

    Richard Szeliski

    Springer, 2nd ed., 2022

    The canonical computer-vision textbook. Free PDF on the author's site.

  2. Probabilistic Machine Learning: Advanced Topics coverfree

    Probabilistic Machine Learning: Advanced Topics

    Kevin P. Murphy

    MIT Press, 2023

    Chapters on VAEs and diffusion for the math behind multimodal generators.

  3. Deep Learning for Vision Systems cover

    Deep Learning for Vision Systems

    Mohamed Elgendy

    Manning, 2020

    Bridges the CV-textbook material to working deep-learning code.

  4. Hands-On Large Language Models cover

    Hands-On Large Language Models

    Jay Alammar, Maarten Grootendorst

    O'Reilly, 2024

    Visual, practical, including Alammar's classic Illustrated Transformer diagrams in book form.