Text-to-Image Diffusion

Diffusion models generate images by starting with noise and denoising step by step. Conditioned on text, they produce images matching the prompt. This was the breakthrough behind DALL-E 2, Stable Diffusion, Midjourney, and the modern image-generation explosion.

The diffusion idea

Two processes:

Forward (noising) — at training time

Take an image; add a tiny bit of Gaussian noise; repeat T times until pure noise.

Reverse (denoising) — what we learn

Train a model that, given a noisy image at step t, predicts the noise. With this, we can step backward from noise to image.

Mathematically, the model learns ε_θ(x_t, t) where:

x_{t-1} = x_t - schedule(t) · ε_θ(x_t, t) + noise

After training, generation = sample noise → denoise step by step → image.

Conditioning on text

A text prompt steers generation. The model takes both the noisy image and the text:

ε_θ(x_t, t, text_embedding)

Text encoders used:

CLIP text encoder (early SD).
T5-XXL (Imagen, SD3).
Custom text-only LLMs (modern variants).

Classifier-free guidance: at sampling, mix conditional and unconditional predictions:

ε = (1+w) · ε_cond − w · ε_uncond

w (guidance scale) controls how strongly the prompt influences output. Higher w = more on-prompt but lower diversity.

Latent diffusion (Stable Diffusion)

Diffusion in pixel space is expensive (millions of pixels). Latent diffusion (Rombach et al. 2022):

Encode image to a low-dim latent (e.g. 64×64×4 instead of 512×512×3) via a VAE.
Diffuse in latent space.
Decode the final latent back to an image.

10–100× cheaper. Used by Stable Diffusion, SDXL, SD3, Flux, most modern open-source models.

Architecture: U-Net vs Transformer

U-Net (classic)

A convolutional encoder-decoder with skip connections. Used by SD 1.x/2.x, SDXL.

DiT (Diffusion Transformer)

Replaces the U-Net with a transformer operating on patches. Used by:

SD3 (Stability AI).
Flux (Black Forest Labs) — the open-source SOTA in late 2024 / early 2025.
Sora for video.
Imagen 3 (Google).

Transformers scale better; they’re now the dominant architecture for new image-gen models.

Modern frontier (early 2026)

Closed

DALL-E 3 (OpenAI): photo-real with strong text rendering.
Imagen 3 / 4 (Google): high quality, integrated into Gemini.
Midjourney v7: aesthetic-focused, very popular for creative work.
Adobe Firefly: licensed-data training, enterprise-friendly.
Reve Image and others.

Open

Flux.1 Dev / Pro / Schnell (Black Forest Labs): open-weights SOTA.
Stable Diffusion 3.5 / SD4 (Stability).
Hidream / HunyuanDiT / PixArt-Σ: research-grade open models.
Sana (NVIDIA): efficient scaled DiT.

For most image-gen needs in 2026, Flux is the open-source default; Midjourney or DALL-E 3 for production polish.

Conditioning beyond text

Modern systems condition on much more:

ControlNet: guide generation with edge maps, depth maps, poses, sketches.
IP-Adapter: condition on a reference image.
DreamBooth / LoRA: personalize a model to a specific subject (your dog, a brand asset).
Inpainting / outpainting: regenerate or extend parts of an image.
Image-to-image (img2img): start from an image, denoise partially.

These compose: a Flux LoRA + ControlNet pose conditioning + IP-Adapter style reference all at once.

Training details

Datasets: web-scraped image-text pairs, often filtered (LAION-5B was the seminal one; modern training uses higher-quality curated mixes).
Compute: tens of thousands of GPU-hours for SOTA models.
Loss: simple — predict the noise. The “magic” is in scale, data, and schedules.

Inference

from diffusers import StableDiffusion3Pipeline
import torch

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    "A cat reading a book",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]

Knobs:

Steps: 20–50 is typical. More = slightly better, much slower.
Guidance scale: 5–10 typical.
Resolution: 512×512 (older), 1024×1024+ (modern).
Sampler / scheduler: DPM++ 2M, Euler, Karras — different speed/quality tradeoffs.

Speed up inference

Distillation: turbo / lightning models do 1–4 step generation. Quality slightly down; speed dramatically up. Flux Schnell is a 4-step distillation.
Latent consistency models (LCM): a different distillation route to fast inference.
Quantization: int8 / fp8 weights; smaller VRAM, faster.
Caching: reuse computations across closely-related prompts.

By 2026, sub-second image generation at 1024px is common on consumer GPUs.

Personalization

To generate “this specific person” or “this specific brand”:

DreamBooth: full fine-tune; expensive but high quality.
LoRA: small adapter; the dominant method.
Textual inversion: learn a new “word” embedding for the concept. Cheap, less flexible.

Train a LoRA on 5–20 photos of a subject → generate new images of them in any context.

Image-gen for production

Concerns:

Copyright / training data: ongoing legal and ethical questions.
Deepfakes / misuse: generated images can mislead.
Watermarking: SynthID (Google), invisible watermarks for provenance.
Content moderation: NSFW filters, public-figure refusal.
Determinism: same prompt + seed should reproduce; helpful for debugging.

For commercial use:

Adobe Firefly: trained on licensed data; safe-for-commercial-use.
Pay-for-API services (DALL-E, Midjourney, Imagen) with terms allowing commercial use.
Open-weights with safe data (e.g. some Mitsua / community models).

Evaluation

Metrics:

FID (Fréchet Inception Distance): distribution similarity to real images. Lower = better.
CLIP-Score: how well the generated image matches the prompt.
Human preference: pairwise comparisons.
VQA metrics: ask a VLM “does this image contain X?”

For real applications, human preference dominates. FID is mostly a research-grade metric.