Video Generation
Adding the time dimension to image generation. Way harder: temporal coherence, physical plausibility, much more compute. The frontier has moved enormously between 2022 and 2026.
The challenge
A video isn’t just a sequence of images. It has:
- Temporal coherence: objects persist; lighting is stable; people don’t blink in and out.
- Physical plausibility: gravity, motion, occlusion, fluids.
- Higher dimensionality: a 5-second 1080p video at 30fps is ~150 frames × ~2M pixels.
- Audio coupling (sometimes): speech matching mouth movement.
Generating coherent video means modeling all of this, with the same self-supervised diffusion or autoregressive paradigm.
Approaches
Diffusion + temporal layers
Start with an image diffusion model. Add temporal attention layers that operate across frames. Train on video data.
- Stable Video Diffusion (Stability AI).
- AnimateDiff: adds motion modules to existing image models.
- Most early-2024 video gen.
Diffusion Transformers (DiT) for video
Treat the entire video as a 3D grid of patches (height × width × time). Apply a transformer.
- Sora (OpenAI, 2024): the breakthrough.
- Veo 2 / 3 (Google, 2025).
- Wan 2.x (Alibaba, open-source).
- HunyuanVideo (Tencent).
- Kling, Runway Gen-3 / 4, Pika: production services.
DiT-based video generation is the dominant paradigm in 2026.
Autoregressive video
Tokenize video; generate token-by-token like an LLM:
- CogVideoX (research).
- VideoPoet (Google research).
Less common; harder to get high resolution.
Frontier (early 2026)
| Model | Producer | Status |
|---|---|---|
| Sora 2 | OpenAI | Commercial |
| Veo 3 | Commercial (via Gemini) | |
| Wan 2.x | Alibaba | Open-weights |
| HunyuanVideo | Tencent | Open-weights |
| Kling 2.x | Kuaishou | Commercial |
| Runway Gen-4 | Runway | Commercial |
| Pika 3 | Pika Labs | Commercial |
Open-source (Wan, Hunyuan) is competitive with closed for most use cases by early 2026.
Capabilities
By 2026 frontier video models can:
- Generate ~5–60 second clips at 720p–1080p (longer at lower resolution).
- Maintain character consistency across shots.
- Follow complex prompts including camera movement, scene changes.
- Handle physics (water, fire, fabric) reasonably for short durations.
- Generate audio-synced lip movements (some models).
Still struggle with:
- Hands and detailed body movement (better than 2023 but still imperfect).
- Long videos (>1 minute) without coherence drift.
- Precise text rendering inside videos.
- Complex multi-object interactions.
Conditioning
Modern video models accept:
- Text prompts.
- First / last frame (image-to-video, useful for animation).
- Reference video (style transfer).
- Reference image of subject (character consistency).
- Camera movement specs (pan, zoom, orbit).
Practical use
Generating a clip
# Pseudo-code for an OSS model
from diffusers import HunyuanVideoPipeline
pipe = HunyuanVideoPipeline.from_pretrained(...)
video = pipe(
prompt="A cat reading a book on a windowsill, sunlight streaming in",
num_frames=120,
height=720, width=1280,
num_inference_steps=50,
).frames
Inference cost
- 5-second 720p clip: $0.10–$2.00 via API; tens of seconds on a top-end GPU locally.
- Longer / higher-res: scales roughly linearly.
- 4K, 30s+: still requires substantial compute.
Motion brushes and controls
Newer products (Runway, Kling) expose direct motion controls:
- “Make this object move along this path.”
- “Camera pans left.”
- “Zoom in on the face.”
This makes video generation feel more like editing than rolling dice.
Production patterns
Short-form content
5–15 second clips for ads, social media, B-roll. Mature use case; off-the-shelf services work.
Storyboard-to-video
Generate keyframes with image gen → use those as keyframes in a video model → animate.
Extension / continuation
Generate an initial clip → use last frame as conditioning → generate next clip → repeat. Good for longer durations with controlled drift.
Hybrid with traditional editing
AI generates raw shots; humans cut and arrange in a traditional NLE. Practical workflow for higher-quality output.
Special domains
Talking-head video / lip sync
For digital avatars, narration, dubbing:
- HeyGen, Synthesia, D-ID: production services.
- Open-source: HunyuanVideo-Avatar, EMO, Hallo.
Different problem set from general video gen; more constrained, higher quality possible.
Motion graphics
Stylized animations, infographics. Tools like Runway, Kaiber specialize in stylized output.
Documentary / talking-head with archive
Generate footage to fill gaps in archive material. Ethically contentious; mark as AI-generated.
Evaluation
Hard. Standard metrics:
- VBench: comprehensive video gen benchmark.
- FVD (Fréchet Video Distance): distribution similarity.
- Human preference: still the gold standard.
For specific applications, build domain-specific evals: does the model handle your particular subjects, styles, durations?
Misuse and detection
Same concerns as image generation, amplified:
- Deepfakes of real people.
- Misinformation: fake event footage.
- NSFW / non-consensual content.
Mitigations:
- Watermarking (C2PA, SynthID).
- Provenance metadata.
- Detection classifiers (cat-and-mouse with generators).
- Platform-level moderation.
By 2026, the generation-vs-detection arms race is roughly even. Don’t trust video as ground truth without provenance.
What’s next
- Longer durations with coherent characters (10+ minutes).
- Higher resolution and frame rates at viable cost.
- Real-time generation (interactive video — games, VR).
- Audio-coupled generation (speech, ambient, foley) integrated.
- Personalization — generate video of yourself reliably.
- 3D understanding — models that infer 3D scene structure for camera control.