Stage 12 — Multimodal AI
Text-only is the past. Frontier models in 2026 are natively multimodal — text, image, audio, video, sometimes 3D and structured data. This stage covers the architectures, the models, and the application patterns.
Prerequisites
- Stage 05 (embeddings)
- Stage 06 (transformers)
Learning ladder
- Multimodal embeddings (CLIP)
- Vision-language models — VLMs, image-to-text, doc understanding
- Text-to-image diffusion
- Video generation — Sora, Wan, Veo
- Speech & TTS — Whisper, modern TTS, real-time voice
- Synthetic data
MVU
You can:
- Pick the right multimodal stack for a use case (search, generation, understanding)
- Embed images and text in a shared space and retrieve across modalities
- Distinguish a vision-language encoder from a generator
- Articulate when to use a diffusion model vs an autoregressive one
Exercise
Build a “search my photos by description” feature. Use CLIP-style embeddings to embed both images and queries; retrieve. Then extend to “answer questions about my photos” using a vision-language model.
See also
- Stage 06 — Transformers
- Stage 09 — RAG — retrieval extends to multimodal
- Stage 14 — Applications



