Stage 12 — Multimodal AI

Text-only is the past. Frontier models in 2026 are natively multimodal — text, image, audio, video, sometimes 3D and structured data. This stage covers the architectures, the models, and the application patterns.

Prerequisites

  • Stage 05 (embeddings)
  • Stage 06 (transformers)

Learning ladder

  1. Multimodal embeddings (CLIP)
  2. Vision-language models — VLMs, image-to-text, doc understanding
  3. Text-to-image diffusion
  4. Video generation — Sora, Wan, Veo
  5. Speech & TTS — Whisper, modern TTS, real-time voice
  6. Synthetic data

MVU

You can:

  • Pick the right multimodal stack for a use case (search, generation, understanding)
  • Embed images and text in a shared space and retrieve across modalities
  • Distinguish a vision-language encoder from a generator
  • Articulate when to use a diffusion model vs an autoregressive one

Exercise

Build a “search my photos by description” feature. Use CLIP-style embeddings to embed both images and queries; retrieve. Then extend to “answer questions about my photos” using a vision-language model.

See also