How Generative AI Creates Images — A Deep Dive
Abstract This article gives a comprehensive, technical, and practical overview of how modern generative AI systems create images. It covers historical developments, the principal model families (GANs, VAEs, autoregressive models, diffusion/score-based models), theoretical foundations, architectures, training objectives, conditioning and control for text-to-image and other conditioned generation, sampling strategies, evaluation metrics, representative implementations (e.g., Stable Diffusion, DALL·E, Imagen), applications, risks and ethics, and likely future directions. Where useful, concise pseudocode and equations illustrate core ideas.
Table of contents
- Introduction
- Historical timeline
- Core model families
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)
- Autoregressive models (PixelRNN/PixelCNN/ImageGPT)
- Score-based and Diffusion Models (DDPM, Score SDE, Latent Diffusion)
- Latent spaces and representation learning
- Conditioning and guidance (text-to-image, image editing, control)
- Training data, losses, and evaluation metrics
- Sampling, speed-quality trade-offs, and practical engineering
- Representative architectures and pipelines
- Stable Diffusion (Latent Diffusion)
- DALL·E family, Imagen, Midjourney (conceptual overview)
- Applications and examples
- Current state of the art and benchmarks
- Ethical, legal, and societal considerations
- Future directions and open problems
- Summary
- Selected references
Introduction
Generative image models aim to learn a probability distribution p(x) over images x and sample realistic images from it, or learn conditional distributions p(x | y) that produce images given conditioning signals y (e.g., text prompts, class labels, other images). Over the past decade these models evolved from producing small textures and low-resolution images to generating high-resolution, photorealistic, and semantically complex images from text prompts.
The main ingredients enabling this progress are:
- scalable neural network architectures (convolutional nets, attention, transformers, U-Nets),
- new probabilistic learning formulations (adversarial training, variational inference, denoising score matching),
- large-scale training data (web-scraped image–text pairs),
- high compute (GPUs/TPUs) and engineering (mixed precision, distributed training),
- innovations in conditioning and guidance (CLIP, classifier-free guidance, cross-attention).
This article unpacks these elements.
Historical timeline (concise)
- Pre-2010s: texture synthesis, patch-based methods, parametric models (PCA), restricted Boltzmann machines.
- 2013–2014: Variational Autoencoders (VAEs) — Kingma & Welling.
- 2014: Generative Adversarial Networks (GANs) — Goodfellow et al.
- 2015–2017: Autoregressive image models (PixelRNN, PixelCNN), growing use of convolutional architectures.
- 2015: Diffusion probabilistic models introduced (Sohl-Dickstein et al.), revisited later.
- 2018–2020: Transformer architectures and large-scale models adapted to images (Image GPT).
- 2020: Denoising Diffusion Probabilistic Models (DDPM) — Ho et al., showing high-quality samples.
- 2021–2022: Text-conditioned generative models (DALL·E, CLIP, Imagen, Stable Diffusion, Midjourney) — major practical breakthroughs driven by large datasets and diffusion models.
- 2022–2024: Rapid refinement, latent diffusion, classifier-free guidance, fine-tuning, inpainting, ControlNet; widespread use in creative industries.
Core model families
We describe the main generative model paradigms used for image synthesis: VAEs, GANs, autoregressive models, and diffusion/score-based models. Each has characteristic training objectives, strengths, and weaknesses.
Variational Autoencoders (VAEs)
High-level idea:
- VAEs learn a latent-space representation z of images x. An encoder qφ(z|x) maps x to a distribution in latent space; a decoder pθ(x|z) reconstructs images from z.
- Training optimizes the Evidence Lower Bound (ELBO): maximize E_{qφ(z|x)}[log pθ(x|z)] - KL(qφ(z|x) || p(z)), where p(z) is a chosen prior (often N(0,I)).
Key properties:
- Likelihood-based, principled probabilistic interpretation.
- Training tends to produce smooth latent representations and stable optimization.
- Tends to produce somewhat blurry images because the pixel-wise likelihood (e.g., Gaussian or Bernoulli output) encourages averaging.
Typical architecture:
- Encoder/Decoder using convolutional networks; latent dimension may be moderate (e.g., 64–2048).
- Variants: β-VAE (tradeoff between reconstruction and disentanglement), hierarchical VAEs, Vector-Quantized VAE (VQ-VAE, Oord et al.) which discretize latents for autoregressive decoders.
Use in modern systems:
- VAE encoders/decoders are used as compression/decompression steps in latent diffusion models (convert image latent space for efficient diffusion).
Equation: ELBO = E_{qφ(z|x)}[log pθ(x|z)] - KL(qφ(z|x) || p(z))
Generative Adversarial Networks (GANs)
High-level idea:
- A generator Gθ(z) maps noise z to images x̂; a discriminator Dϕ(x) tries to distinguish real images x from generated ones x̂. Training is a min-max game:
minG maxD E{x∼pdata}[log D(x)] + E{z∼p(z)}[log (1 − D(G(z)))].
Key properties:
- GANs produce very sharp, photorealistic images.
- Training can be unstable; problems like mode collapse (generator produces limited variety) occur.
- Many stabilizations: Wasserstein GANs (WGAN), gradient penalty, spectral normalization, progressive growing.
Advantages:
- High fidelity and sharp details in generated images.
Disadvantages:
- Hard to train, sensitive hyperparameters, lack of explicit likelihood.
Typical GAN training loop (pseudocode): ``` for each training step:
Update discriminator
xreal = samplerealbatch() z = samplenoise() xfake = G(z).detach() lossD = -E[log D(xreal)] - E[log(1 - D(xfake))] optimizeD(lossD)
Update generator
z = samplenoise() xfake = G(z) lossG = -E[log D(xfake)] optimizeG(lossG) ```
GANs were dominant for realistic image generation before diffusion models achieved comparable or superior quality with easier scaling properties for conditioning.
Autoregressive models
High-level idea:
- Model the joint distribution p(x) as a product of conditional distributions over pixels or tokens: p(x) = ∏ p(xt | x{ =1). This boosts adherence to conditioning while retaining diversity.
- Additional controls and extensions:
- ControlNet: add extra conditioning channels (e.g., depth maps, edge maps, sketches) via trainable control networks connected to the diffusion denoiser.
- Attention maps, spatial conditioning, and masking for inpainting.
- Prompt engineering: crafting text prompts to shape style, composition, and attributes. Prompts are often combined with special tokens and weights.
Classifier-free guidance equation (score space): scoreguided = scoreuncond + scale * (scorecond - scoreuncond)
This mechanism is crucial for modern text-to-image systems to balance fidelity to prompt versus sample quality/diversity.
Training data, losses, and evaluation metrics
Training data:
- Image datasets historically: CIFAR, CelebA, LSUN, ImageNet.
- For text-to-image: Conceptual Captions, COCO Captions, and large web-scale scraped datasets like LAION-400M/LAION-5B (image-text pairs).
- Dataset quality, filtering, and licensing issues are critical.
Losses:
- VAEs: ELBO = reconstruction loss + KL regularizer.
- GANs: adversarial loss (plus auxiliary reconstruction/perceptual losses).
- Diffusion: denoising MSE (predict noise) or score matching losses; optionally perceptual or adversarial losses when seeking higher visual fidelity.
Evaluation metrics:
- Fréchet Inception Distance (FID): compares statistics of feature activations of generated vs real images. Lower is better.
- Inception Score (IS): measures both classifiability and diversity; has limitations.
- Kernel Inception Distance (KID): alternative to FID.
- Precision and recall for generative models: measure fidelity (precision) and coverage (recall).
- Human evaluation and downstream task utility assessments.
- Text–image alignment metrics (for conditioned models): CLIP similarity between prompt and generated images.
Limitations:
- FID and IS can be gamed and are sensitive to dataset and preprocessing choices; human evaluation remains important.
Sampling, speed-quality trade-offs, and practical engineering
Diffusion models historically required many denoising steps (e.g., 1000) producing high quality but slow sampling. Engineering work improved this:
- DDIM (Denoising Diffusion Implicit Models): deterministic sampling with fewer steps while preserving quality.
- Sampler schedulers: linear, cosine, quadratic noise schedules affect sample quality.
- Distillation: distill many-step samplers into fewer-step samplers (e.g., progressive distillation).
- Latent diffusion: run diffusion in compressed latent space — large speed and memory gains.
- Model quantization, pruning, and mixed precision for deployment.
Sampling hyperparameters:
- Guidance scale (classifier-free guidance): increases fidelity to conditioning but can reduce diversity or create artifacts if too large.
- Temperature/eta: parameters controlling randomness/ancestry in samplers.
Practical tips for image generation:
- Use classifier-free guidance with moderate scale (e.g., 5–15) for text prompts.
- High-resolution generation often uses ...