demo · animated + interactive
From noise, structure
Watch a recognizable shape dissolve into Gaussian static, then reassemble. This is the toy version of every image generator — Stable Diffusion, DALL-E, Imagen — running at full speed.
The mechanism
A diffusion model defines two processes:
- Forward (q): at each step, add a small amount
of Gaussian noise. After T steps, your data is pure noise. This
process is fixed — not learned. The math is
x_t = √α̅_t · x_0 + √(1 − α̅_t) · ε. - Reverse (p_θ): learn to predict the noise that was added at each step. Subtract the predicted noise to step backward. After T iterations, you've gone from pure Gaussian to a sample of the data distribution.
Why it works
The reverse process is much easier to learn than directly predicting an image. Predicting the noise at each step is local, well-conditioned, and trains stably. The hard part — the complex global structure of an image — emerges over many small denoising steps, not in one shot.
Real diffusion is image-shaped, not point-shaped
A real model has a 64-dim latent grid, a U-Net with cross- attention to text, classifier-free guidance, schedulers like DDIM or Euler that take 20–50 steps instead of 1000. Same idea, just with a much higher-dimensional state space. The 2D toy here is the academic teaching version — it shows the shape of the mechanism with no math you can't see.
Anchored to 12-multimodal/text-to-image-diffusion.