Video Generation

Adding the time dimension to image generation. Way harder: temporal coherence, physical plausibility, much more compute. The frontier has moved enormously between 2022 and 2026.

The challenge

A video isn’t just a sequence of images. It has:

  • Temporal coherence: objects persist; lighting is stable; people don’t blink in and out.
  • Physical plausibility: gravity, motion, occlusion, fluids.
  • Higher dimensionality: a 5-second 1080p video at 30fps is ~150 frames × ~2M pixels.
  • Audio coupling (sometimes): speech matching mouth movement.

Generating coherent video means modeling all of this, with the same self-supervised diffusion or autoregressive paradigm.

Approaches

Diffusion + temporal layers

Start with an image diffusion model. Add temporal attention layers that operate across frames. Train on video data.

  • Stable Video Diffusion (Stability AI).
  • AnimateDiff: adds motion modules to existing image models.
  • Most early-2024 video gen.

Diffusion Transformers (DiT) for video

Treat the entire video as a 3D grid of patches (height × width × time). Apply a transformer.

  • Sora (OpenAI, 2024): the breakthrough.
  • Veo 2 / 3 (Google, 2025).
  • Wan 2.x (Alibaba, open-source).
  • HunyuanVideo (Tencent).
  • Kling, Runway Gen-3 / 4, Pika: production services.

DiT-based video generation is the dominant paradigm in 2026.

Autoregressive video

Tokenize video; generate token-by-token like an LLM:

  • CogVideoX (research).
  • VideoPoet (Google research).

Less common; harder to get high resolution.

Frontier (early 2026)

ModelProducerStatus
Sora 2OpenAICommercial
Veo 3GoogleCommercial (via Gemini)
Wan 2.xAlibabaOpen-weights
HunyuanVideoTencentOpen-weights
Kling 2.xKuaishouCommercial
Runway Gen-4RunwayCommercial
Pika 3Pika LabsCommercial

Open-source (Wan, Hunyuan) is competitive with closed for most use cases by early 2026.

Capabilities

By 2026 frontier video models can:

  • Generate ~5–60 second clips at 720p–1080p (longer at lower resolution).
  • Maintain character consistency across shots.
  • Follow complex prompts including camera movement, scene changes.
  • Handle physics (water, fire, fabric) reasonably for short durations.
  • Generate audio-synced lip movements (some models).

Still struggle with:

  • Hands and detailed body movement (better than 2023 but still imperfect).
  • Long videos (>1 minute) without coherence drift.
  • Precise text rendering inside videos.
  • Complex multi-object interactions.

Conditioning

Modern video models accept:

  • Text prompts.
  • First / last frame (image-to-video, useful for animation).
  • Reference video (style transfer).
  • Reference image of subject (character consistency).
  • Camera movement specs (pan, zoom, orbit).

Practical use

Generating a clip

# Pseudo-code for an OSS model
from diffusers import HunyuanVideoPipeline

pipe = HunyuanVideoPipeline.from_pretrained(...)
video = pipe(
    prompt="A cat reading a book on a windowsill, sunlight streaming in",
    num_frames=120,
    height=720, width=1280,
    num_inference_steps=50,
).frames

Inference cost

  • 5-second 720p clip: $0.10–$2.00 via API; tens of seconds on a top-end GPU locally.
  • Longer / higher-res: scales roughly linearly.
  • 4K, 30s+: still requires substantial compute.

Motion brushes and controls

Newer products (Runway, Kling) expose direct motion controls:

  • “Make this object move along this path.”
  • “Camera pans left.”
  • “Zoom in on the face.”

This makes video generation feel more like editing than rolling dice.

Production patterns

Short-form content

5–15 second clips for ads, social media, B-roll. Mature use case; off-the-shelf services work.

Storyboard-to-video

Generate keyframes with image gen → use those as keyframes in a video model → animate.

Extension / continuation

Generate an initial clip → use last frame as conditioning → generate next clip → repeat. Good for longer durations with controlled drift.

Hybrid with traditional editing

AI generates raw shots; humans cut and arrange in a traditional NLE. Practical workflow for higher-quality output.

Special domains

Talking-head video / lip sync

For digital avatars, narration, dubbing:

  • HeyGen, Synthesia, D-ID: production services.
  • Open-source: HunyuanVideo-Avatar, EMO, Hallo.

Different problem set from general video gen; more constrained, higher quality possible.

Motion graphics

Stylized animations, infographics. Tools like Runway, Kaiber specialize in stylized output.

Documentary / talking-head with archive

Generate footage to fill gaps in archive material. Ethically contentious; mark as AI-generated.

Evaluation

Hard. Standard metrics:

  • VBench: comprehensive video gen benchmark.
  • FVD (Fréchet Video Distance): distribution similarity.
  • Human preference: still the gold standard.

For specific applications, build domain-specific evals: does the model handle your particular subjects, styles, durations?

Misuse and detection

Same concerns as image generation, amplified:

  • Deepfakes of real people.
  • Misinformation: fake event footage.
  • NSFW / non-consensual content.

Mitigations:

  • Watermarking (C2PA, SynthID).
  • Provenance metadata.
  • Detection classifiers (cat-and-mouse with generators).
  • Platform-level moderation.

By 2026, the generation-vs-detection arms race is roughly even. Don’t trust video as ground truth without provenance.

What’s next

  • Longer durations with coherent characters (10+ minutes).
  • Higher resolution and frame rates at viable cost.
  • Real-time generation (interactive video — games, VR).
  • Audio-coupled generation (speech, ambient, foley) integrated.
  • Personalization — generate video of yourself reliably.
  • 3D understanding — models that infer 3D scene structure for camera control.

See also