Text-to-Image Diffusion
Diffusion models generate images by starting with noise and denoising step by step. Conditioned on text, they produce images matching the prompt. This was the breakthrough behind DALL-E 2, Stable Diffusion, Midjourney, and the modern image-generation explosion.
The diffusion idea
Two processes:
Forward (noising) — at training time
Take an image; add a tiny bit of Gaussian noise; repeat T times until pure noise.
Reverse (denoising) — what we learn
Train a model that, given a noisy image at step t, predicts the noise. With this, we can step backward from noise to image.
Mathematically, the model learns ε_θ(x_t, t) where:
x_{t-1} = x_t - schedule(t) · ε_θ(x_t, t) + noise
After training, generation = sample noise → denoise step by step → image.
Conditioning on text
A text prompt steers generation. The model takes both the noisy image and the text:
ε_θ(x_t, t, text_embedding)
Text encoders used:
- CLIP text encoder (early SD).
- T5-XXL (Imagen, SD3).
- Custom text-only LLMs (modern variants).
Classifier-free guidance: at sampling, mix conditional and unconditional predictions:
ε = (1+w) · ε_cond − w · ε_uncond
w (guidance scale) controls how strongly the prompt influences output. Higher w = more on-prompt but lower diversity.
Latent diffusion (Stable Diffusion)
Diffusion in pixel space is expensive (millions of pixels). Latent diffusion (Rombach et al. 2022):
- Encode image to a low-dim latent (e.g. 64×64×4 instead of 512×512×3) via a VAE.
- Diffuse in latent space.
- Decode the final latent back to an image.
10–100× cheaper. Used by Stable Diffusion, SDXL, SD3, Flux, most modern open-source models.
Architecture: U-Net vs Transformer
U-Net (classic)
A convolutional encoder-decoder with skip connections. Used by SD 1.x/2.x, SDXL.
DiT (Diffusion Transformer)
Replaces the U-Net with a transformer operating on patches. Used by:
- SD3 (Stability AI).
- Flux (Black Forest Labs) — the open-source SOTA in late 2024 / early 2025.
- Sora for video.
- Imagen 3 (Google).
Transformers scale better; they’re now the dominant architecture for new image-gen models.
Modern frontier (early 2026)
Closed
- DALL-E 3 (OpenAI): photo-real with strong text rendering.
- Imagen 3 / 4 (Google): high quality, integrated into Gemini.
- Midjourney v7: aesthetic-focused, very popular for creative work.
- Adobe Firefly: licensed-data training, enterprise-friendly.
- Reve Image and others.
Open
- Flux.1 Dev / Pro / Schnell (Black Forest Labs): open-weights SOTA.
- Stable Diffusion 3.5 / SD4 (Stability).
- Hidream / HunyuanDiT / PixArt-Σ: research-grade open models.
- Sana (NVIDIA): efficient scaled DiT.
For most image-gen needs in 2026, Flux is the open-source default; Midjourney or DALL-E 3 for production polish.
Conditioning beyond text
Modern systems condition on much more:
- ControlNet: guide generation with edge maps, depth maps, poses, sketches.
- IP-Adapter: condition on a reference image.
- DreamBooth / LoRA: personalize a model to a specific subject (your dog, a brand asset).
- Inpainting / outpainting: regenerate or extend parts of an image.
- Image-to-image (img2img): start from an image, denoise partially.
These compose: a Flux LoRA + ControlNet pose conditioning + IP-Adapter style reference all at once.
Training details
- Datasets: web-scraped image-text pairs, often filtered (LAION-5B was the seminal one; modern training uses higher-quality curated mixes).
- Compute: tens of thousands of GPU-hours for SOTA models.
- Loss: simple — predict the noise. The “magic” is in scale, data, and schedules.
Inference
from diffusers import StableDiffusion3Pipeline
import torch
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe(
"A cat reading a book",
num_inference_steps=28,
guidance_scale=7.0,
).images[0]
Knobs:
- Steps: 20–50 is typical. More = slightly better, much slower.
- Guidance scale: 5–10 typical.
- Resolution: 512×512 (older), 1024×1024+ (modern).
- Sampler / scheduler: DPM++ 2M, Euler, Karras — different speed/quality tradeoffs.
Speed up inference
- Distillation: turbo / lightning models do 1–4 step generation. Quality slightly down; speed dramatically up. Flux Schnell is a 4-step distillation.
- Latent consistency models (LCM): a different distillation route to fast inference.
- Quantization: int8 / fp8 weights; smaller VRAM, faster.
- Caching: reuse computations across closely-related prompts.
By 2026, sub-second image generation at 1024px is common on consumer GPUs.
Personalization
To generate “this specific person” or “this specific brand”:
- DreamBooth: full fine-tune; expensive but high quality.
- LoRA: small adapter; the dominant method.
- Textual inversion: learn a new “word” embedding for the concept. Cheap, less flexible.
Train a LoRA on 5–20 photos of a subject → generate new images of them in any context.
Image-gen for production
Concerns:
- Copyright / training data: ongoing legal and ethical questions.
- Deepfakes / misuse: generated images can mislead.
- Watermarking: SynthID (Google), invisible watermarks for provenance.
- Content moderation: NSFW filters, public-figure refusal.
- Determinism: same prompt + seed should reproduce; helpful for debugging.
For commercial use:
- Adobe Firefly: trained on licensed data; safe-for-commercial-use.
- Pay-for-API services (DALL-E, Midjourney, Imagen) with terms allowing commercial use.
- Open-weights with safe data (e.g. some Mitsua / community models).
Evaluation
Metrics:
- FID (Fréchet Inception Distance): distribution similarity to real images. Lower = better.
- CLIP-Score: how well the generated image matches the prompt.
- Human preference: pairwise comparisons.
- VQA metrics: ask a VLM “does this image contain X?”
For real applications, human preference dominates. FID is mostly a research-grade metric.