Text-to-Image Diffusion

Diffusion models generate images by starting with noise and denoising step by step. Conditioned on text, they produce images matching the prompt. This was the breakthrough behind DALL-E 2, Stable Diffusion, Midjourney, and the modern image-generation explosion.

The diffusion idea

Two processes:

Forward (noising) — at training time

Take an image; add a tiny bit of Gaussian noise; repeat T times until pure noise.

Reverse (denoising) — what we learn

Train a model that, given a noisy image at step t, predicts the noise. With this, we can step backward from noise to image.

Mathematically, the model learns ε_θ(x_t, t) where:

x_{t-1} = x_t - schedule(t) · ε_θ(x_t, t) + noise

After training, generation = sample noise → denoise step by step → image.

Conditioning on text

A text prompt steers generation. The model takes both the noisy image and the text:

ε_θ(x_t, t, text_embedding)

Text encoders used:

  • CLIP text encoder (early SD).
  • T5-XXL (Imagen, SD3).
  • Custom text-only LLMs (modern variants).

Classifier-free guidance: at sampling, mix conditional and unconditional predictions:

ε = (1+w) · ε_cond − w · ε_uncond

w (guidance scale) controls how strongly the prompt influences output. Higher w = more on-prompt but lower diversity.

Latent diffusion (Stable Diffusion)

Diffusion in pixel space is expensive (millions of pixels). Latent diffusion (Rombach et al. 2022):

  1. Encode image to a low-dim latent (e.g. 64×64×4 instead of 512×512×3) via a VAE.
  2. Diffuse in latent space.
  3. Decode the final latent back to an image.

10–100× cheaper. Used by Stable Diffusion, SDXL, SD3, Flux, most modern open-source models.

Architecture: U-Net vs Transformer

U-Net (classic)

A convolutional encoder-decoder with skip connections. Used by SD 1.x/2.x, SDXL.

DiT (Diffusion Transformer)

Replaces the U-Net with a transformer operating on patches. Used by:

  • SD3 (Stability AI).
  • Flux (Black Forest Labs) — the open-source SOTA in late 2024 / early 2025.
  • Sora for video.
  • Imagen 3 (Google).

Transformers scale better; they’re now the dominant architecture for new image-gen models.

Modern frontier (early 2026)

Closed

  • DALL-E 3 (OpenAI): photo-real with strong text rendering.
  • Imagen 3 / 4 (Google): high quality, integrated into Gemini.
  • Midjourney v7: aesthetic-focused, very popular for creative work.
  • Adobe Firefly: licensed-data training, enterprise-friendly.
  • Reve Image and others.

Open

  • Flux.1 Dev / Pro / Schnell (Black Forest Labs): open-weights SOTA.
  • Stable Diffusion 3.5 / SD4 (Stability).
  • Hidream / HunyuanDiT / PixArt-Σ: research-grade open models.
  • Sana (NVIDIA): efficient scaled DiT.

For most image-gen needs in 2026, Flux is the open-source default; Midjourney or DALL-E 3 for production polish.

Conditioning beyond text

Modern systems condition on much more:

  • ControlNet: guide generation with edge maps, depth maps, poses, sketches.
  • IP-Adapter: condition on a reference image.
  • DreamBooth / LoRA: personalize a model to a specific subject (your dog, a brand asset).
  • Inpainting / outpainting: regenerate or extend parts of an image.
  • Image-to-image (img2img): start from an image, denoise partially.

These compose: a Flux LoRA + ControlNet pose conditioning + IP-Adapter style reference all at once.

Training details

  • Datasets: web-scraped image-text pairs, often filtered (LAION-5B was the seminal one; modern training uses higher-quality curated mixes).
  • Compute: tens of thousands of GPU-hours for SOTA models.
  • Loss: simple — predict the noise. The “magic” is in scale, data, and schedules.

Inference

from diffusers import StableDiffusion3Pipeline
import torch

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    "A cat reading a book",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]

Knobs:

  • Steps: 20–50 is typical. More = slightly better, much slower.
  • Guidance scale: 5–10 typical.
  • Resolution: 512×512 (older), 1024×1024+ (modern).
  • Sampler / scheduler: DPM++ 2M, Euler, Karras — different speed/quality tradeoffs.

Speed up inference

  • Distillation: turbo / lightning models do 1–4 step generation. Quality slightly down; speed dramatically up. Flux Schnell is a 4-step distillation.
  • Latent consistency models (LCM): a different distillation route to fast inference.
  • Quantization: int8 / fp8 weights; smaller VRAM, faster.
  • Caching: reuse computations across closely-related prompts.

By 2026, sub-second image generation at 1024px is common on consumer GPUs.

Personalization

To generate “this specific person” or “this specific brand”:

  • DreamBooth: full fine-tune; expensive but high quality.
  • LoRA: small adapter; the dominant method.
  • Textual inversion: learn a new “word” embedding for the concept. Cheap, less flexible.
Train a LoRA on 5–20 photos of a subject → generate new images of them in any context.

Image-gen for production

Concerns:

  • Copyright / training data: ongoing legal and ethical questions.
  • Deepfakes / misuse: generated images can mislead.
  • Watermarking: SynthID (Google), invisible watermarks for provenance.
  • Content moderation: NSFW filters, public-figure refusal.
  • Determinism: same prompt + seed should reproduce; helpful for debugging.

For commercial use:

  • Adobe Firefly: trained on licensed data; safe-for-commercial-use.
  • Pay-for-API services (DALL-E, Midjourney, Imagen) with terms allowing commercial use.
  • Open-weights with safe data (e.g. some Mitsua / community models).

Evaluation

Metrics:

  • FID (Fréchet Inception Distance): distribution similarity to real images. Lower = better.
  • CLIP-Score: how well the generated image matches the prompt.
  • Human preference: pairwise comparisons.
  • VQA metrics: ask a VLM “does this image contain X?”

For real applications, human preference dominates. FID is mostly a research-grade metric.

See also