Diffusion Process · ai-explained

The mechanism

A diffusion model defines two processes:

Forward (q): at each step, add a small amount of Gaussian noise. After T steps, your data is pure noise. This process is fixed — not learned. The math is x_t = √α̅_t · x_0 + √(1 − α̅_t) · ε.
Reverse (p_θ): learn to predict the noise that was added at each step. Subtract the predicted noise to step backward. After T iterations, you've gone from pure Gaussian to a sample of the data distribution.

Why it works

The reverse process is much easier to learn than directly predicting an image. Predicting the noise at each step is local, well-conditioned, and trains stably. The hard part — the complex global structure of an image — emerges over many small denoising steps, not in one shot.

Real diffusion is image-shaped, not point-shaped

A real model has a 64-dim latent grid, a U-Net with cross- attention to text, classifier-free guidance, schedulers like DDIM or Euler that take 20–50 steps instead of 1000. Same idea, just with a much higher-dimensional state space. The 2D toy here is the academic teaching version — it shows the shape of the mechanism with no math you can't see.

Anchored to 12-multimodal/text-to-image-diffusion.

From noise, structure

The mechanism

Why it works

Real diffusion is image-shaped, not point-shaped