Stage 12 — Multimodal AI

Text-only is the past. Frontier models in 2026 are natively multimodal — text, image, audio, video, sometimes 3D and structured data. This stage covers the architectures, the models, and the application patterns.

Prerequisites

Stage 05 (embeddings)
Stage 06 (transformers)

Learning ladder

Multimodal embeddings (CLIP)
Vision-language models — VLMs, image-to-text, doc understanding
Text-to-image diffusion
Video generation — Sora, Wan, Veo
Speech & TTS — Whisper, modern TTS, real-time voice
Synthetic data

MVU

You can:

Pick the right multimodal stack for a use case (search, generation, understanding)
Embed images and text in a shared space and retrieve across modalities
Distinguish a vision-language encoder from a generator
Articulate when to use a diffusion model vs an autoregressive one

Exercise

Build a “search my photos by description” feature. Use CLIP-style embeddings to embed both images and queries; retrieve. Then extend to “answer questions about my photos” using a vision-language model.

Multimodal

Articles in this stage

Stage 12 — Multimodal AI

Prerequisites

Learning ladder

MVU

Exercise

See also

Further reading

Computer Vision: Algorithms and Applications

Probabilistic Machine Learning: Advanced Topics

Deep Learning for Vision Systems

Hands-On Large Language Models