Video Generation

Adding the time dimension to image generation. Way harder: temporal coherence, physical plausibility, much more compute. The frontier has moved enormously between 2022 and 2026.

The challenge

A video isn’t just a sequence of images. It has:

Temporal coherence: objects persist; lighting is stable; people don’t blink in and out.
Physical plausibility: gravity, motion, occlusion, fluids.
Higher dimensionality: a 5-second 1080p video at 30fps is ~150 frames × ~2M pixels.
Audio coupling (sometimes): speech matching mouth movement.

Generating coherent video means modeling all of this, with the same self-supervised diffusion or autoregressive paradigm.

Approaches

Diffusion + temporal layers

Start with an image diffusion model. Add temporal attention layers that operate across frames. Train on video data.

Stable Video Diffusion (Stability AI).
AnimateDiff: adds motion modules to existing image models.
Most early-2024 video gen.

Diffusion Transformers (DiT) for video

Treat the entire video as a 3D grid of patches (height × width × time). Apply a transformer.

Sora (OpenAI, 2024): the breakthrough.
Veo 2 / 3 (Google, 2025).
Wan 2.x (Alibaba, open-source).
HunyuanVideo (Tencent).
Kling, Runway Gen-3 / 4, Pika: production services.

DiT-based video generation is the dominant paradigm in 2026.

Autoregressive video

Tokenize video; generate token-by-token like an LLM:

CogVideoX (research).
VideoPoet (Google research).

Less common; harder to get high resolution.

Frontier (early 2026)

Model	Producer	Status
Sora 2	OpenAI	Commercial
Veo 3	Google	Commercial (via Gemini)
Wan 2.x	Alibaba	Open-weights
HunyuanVideo	Tencent	Open-weights
Kling 2.x	Kuaishou	Commercial
Runway Gen-4	Runway	Commercial
Pika 3	Pika Labs	Commercial

Open-source (Wan, Hunyuan) is competitive with closed for most use cases by early 2026.

Capabilities

By 2026 frontier video models can:

Generate ~5–60 second clips at 720p–1080p (longer at lower resolution).
Maintain character consistency across shots.
Follow complex prompts including camera movement, scene changes.
Handle physics (water, fire, fabric) reasonably for short durations.
Generate audio-synced lip movements (some models).

Still struggle with:

Hands and detailed body movement (better than 2023 but still imperfect).
Long videos (>1 minute) without coherence drift.
Precise text rendering inside videos.
Complex multi-object interactions.

Conditioning

Modern video models accept:

Text prompts.
First / last frame (image-to-video, useful for animation).
Reference video (style transfer).
Reference image of subject (character consistency).
Camera movement specs (pan, zoom, orbit).

Practical use

Generating a clip

# Pseudo-code for an OSS model
from diffusers import HunyuanVideoPipeline

pipe = HunyuanVideoPipeline.from_pretrained(...)
video = pipe(
    prompt="A cat reading a book on a windowsill, sunlight streaming in",
    num_frames=120,
    height=720, width=1280,
    num_inference_steps=50,
).frames

Inference cost

5-second 720p clip: $0.10–$2.00 via API; tens of seconds on a top-end GPU locally.
Longer / higher-res: scales roughly linearly.
4K, 30s+: still requires substantial compute.

Motion brushes and controls

Newer products (Runway, Kling) expose direct motion controls:

“Make this object move along this path.”
“Camera pans left.”
“Zoom in on the face.”

This makes video generation feel more like editing than rolling dice.

Production patterns

Short-form content

5–15 second clips for ads, social media, B-roll. Mature use case; off-the-shelf services work.

Storyboard-to-video

Generate keyframes with image gen → use those as keyframes in a video model → animate.

Extension / continuation

Generate an initial clip → use last frame as conditioning → generate next clip → repeat. Good for longer durations with controlled drift.

Hybrid with traditional editing

AI generates raw shots; humans cut and arrange in a traditional NLE. Practical workflow for higher-quality output.

Special domains

Talking-head video / lip sync

For digital avatars, narration, dubbing:

HeyGen, Synthesia, D-ID: production services.
Open-source: HunyuanVideo-Avatar, EMO, Hallo.

Different problem set from general video gen; more constrained, higher quality possible.

Motion graphics

Stylized animations, infographics. Tools like Runway, Kaiber specialize in stylized output.

Documentary / talking-head with archive

Generate footage to fill gaps in archive material. Ethically contentious; mark as AI-generated.

Evaluation

Hard. Standard metrics:

VBench: comprehensive video gen benchmark.
FVD (Fréchet Video Distance): distribution similarity.
Human preference: still the gold standard.

For specific applications, build domain-specific evals: does the model handle your particular subjects, styles, durations?

Misuse and detection

Same concerns as image generation, amplified:

Deepfakes of real people.
Misinformation: fake event footage.
NSFW / non-consensual content.

Mitigations:

Watermarking (C2PA, SynthID).
Provenance metadata.
Detection classifiers (cat-and-mouse with generators).
Platform-level moderation.

By 2026, the generation-vs-detection arms race is roughly even. Don’t trust video as ground truth without provenance.

What’s next

Longer durations with coherent characters (10+ minutes).
Higher resolution and frame rates at viable cost.
Real-time generation (interactive video — games, VR).
Audio-coupled generation (speech, ambient, foley) integrated.
Personalization — generate video of yourself reliably.
3D understanding — models that infer 3D scene structure for camera control.