How Generative AI Creates Images — A Deep Dive

Abstract
This article gives a comprehensive, technical, and practical overview of how modern generative AI systems create images. It covers historical developments, the principal model families (GANs, VAEs, autoregressive models, diffusion/score-based models), theoretical foundations, architectures, training objectives, conditioning and control for text-to-image and other conditioned generation, sampling strategies, evaluation metrics, representative implementations (e.g., Stable Diffusion, DALL·E, Imagen), applications, risks and ethics, and likely future directions. Where useful, concise pseudocode and equations illustrate core ideas.

Table of contents

  • Introduction
  • Historical timeline
  • Core model families
    • Variational Autoencoders (VAEs)
    • Generative Adversarial Networks (GANs)
    • Autoregressive models (PixelRNN/PixelCNN/ImageGPT)
    • Score-based and Diffusion Models (DDPM, Score SDE, Latent Diffusion)
  • Latent spaces and representation learning
  • Conditioning and guidance (text-to-image, image editing, control)
  • Training data, losses, and evaluation metrics
  • Sampling, speed-quality trade-offs, and practical engineering
  • Representative architectures and pipelines
    • Stable Diffusion (Latent Diffusion)
    • DALL·E family, Imagen, Midjourney (conceptual overview)
  • Applications and examples
  • Current state of the art and benchmarks
  • Ethical, legal, and societal considerations
  • Future directions and open problems
  • Summary
  • Selected references

Introduction

Generative image models aim to learn a probability distribution p(x) over images x and sample realistic images from it, or learn conditional distributions p(x | y) that produce images given conditioning signals y (e.g., text prompts, class labels, other images). Over the past decade these models evolved from producing small textures and low-resolution images to generating high-resolution, photorealistic, and semantically complex images from text prompts.

The main ingredients enabling this progress are:

  • scalable neural network architectures (convolutional nets, attention, transformers, U-Nets),
  • new probabilistic learning formulations (adversarial training, variational inference, denoising score matching),
  • large-scale training data (web-scraped image–text pairs),
  • high compute (GPUs/TPUs) and engineering (mixed precision, distributed training),
  • innovations in conditioning and guidance (CLIP, classifier-free guidance, cross-attention).

This article unpacks these elements.


Historical timeline (concise)

  • Pre-2010s: texture synthesis, patch-based methods, parametric models (PCA), restricted Boltzmann machines.
  • 2013–2014: Variational Autoencoders (VAEs) — Kingma & Welling.
  • 2014: Generative Adversarial Networks (GANs) — Goodfellow et al.
  • 2015–2017: Autoregressive image models (PixelRNN, PixelCNN), growing use of convolutional architectures.
  • 2015: Diffusion probabilistic models introduced (Sohl-Dickstein et al.), revisited later.
  • 2018–2020: Transformer architectures and large-scale models adapted to images (Image GPT).
  • 2020: Denoising Diffusion Probabilistic Models (DDPM) — Ho et al., showing high-quality samples.
  • 2021–2022: Text-conditioned generative models (DALL·E, CLIP, Imagen, Stable Diffusion, Midjourney) — major practical breakthroughs driven by large datasets and diffusion models.
  • 2022–2024: Rapid refinement, latent diffusion, classifier-free guidance, fine-tuning, inpainting, ControlNet; widespread use in creative industries.

Core model families

We describe the main generative model paradigms used for image synthesis: VAEs, GANs, autoregressive models, and diffusion/score-based models. Each has characteristic training objectives, strengths, and weaknesses.

Variational Autoencoders (VAEs)

High-level idea:

  • VAEs learn a latent-space representation z of images x. An encoder qφ(z|x) maps x to a distribution in latent space; a decoder pθ(x|z) reconstructs images from z.
  • Training optimizes the Evidence Lower Bound (ELBO): maximize E_{qφ(z|x)}[log pθ(x|z)] - KL(qφ(z|x) || p(z)), where p(z) is a chosen prior (often N(0,I)).

Key properties:

  • Likelihood-based, principled probabilistic interpretation.
  • Training tends to produce smooth latent representations and stable optimization.
  • Tends to produce somewhat blurry images because the pixel-wise likelihood (e.g., Gaussian or Bernoulli output) encourages averaging.

Typical architecture:

  • Encoder/Decoder using convolutional networks; latent dimension may be moderate (e.g., 64–2048).
  • Variants: β-VAE (tradeoff between reconstruction and disentanglement), hierarchical VAEs, Vector-Quantized VAE (VQ-VAE, Oord et al.) which discretize latents for autoregressive decoders.

Use in modern systems:

  • VAE encoders/decoders are used as compression/decompression steps in latent diffusion models (convert image <-> latent space for efficient diffusion).

Equation: ELBO = E_{qφ(z|x)}[log pθ(x|z)] - KL(qφ(z|x) || p(z))

Generative Adversarial Networks (GANs)

High-level idea:

  • A generator Gθ(z) maps noise z to images x̂; a discriminator Dϕ(x) tries to distinguish real images x from generated ones x̂. Training is a min-max game: min_G max_D E_{x∼pdata}[log D(x)] + E_{z∼p(z)}[log (1 − D(G(z)))].

Key properties:

  • GANs produce very sharp, photorealistic images.
  • Training can be unstable; problems like mode collapse (generator produces limited variety) occur.
  • Many stabilizations: Wasserstein GANs (WGAN), gradient penalty, spectral normalization, progressive growing.

Advantages:

  • High fidelity and sharp details in generated images. Disadvantages:
  • Hard to train, sensitive hyperparameters, lack of explicit likelihood.

Typical GAN training loop (pseudocode):

Plain Text
1for each training step: 2 # Update discriminator 3 x_real = sample_real_batch() 4 z = sample_noise() 5 x_fake = G(z).detach() 6 loss_D = -E[log D(x_real)] - E[log(1 - D(x_fake))] 7 optimize_D(loss_D) 8 9 # Update generator 10 z = sample_noise() 11 x_fake = G(z) 12 loss_G = -E[log D(x_fake)] 13 optimize_G(loss_G)

GANs were dominant for realistic image generation before diffusion models achieved comparable or superior quality with easier scaling properties for conditioning.

Autoregressive models

High-level idea:

  • Model the joint distribution p(x) as a product of conditional distributions over pixels or tokens: p(x) = ∏ p(x_t | x_{<t}). Pixels can be raster-scanned or represented as tokens (VQ-VAE tokens).
  • PixelRNN/PixelCNN model pixel-level autoregressive dependencies via convolutional networks.

Properties:

  • Exact likelihood, often good sample diversity.
  • Sampling is sequential and slow for high-resolution images (one pixel/token at a time).
  • Transformers applied to image tokens (Image GPT, VQ-GAN + transformers) allowed large-scale modeling but sampling speed remains an issue.

Use cases:

  • Image compression, unconditional generation, foundation for some text-to-image models (DALL·E used discrete token approaches in earlier variants).

Score-based and Diffusion Models (DDPMs, Score SDEs)

This class has become the dominant approach for high-quality, flexible image synthesis in recent years.

High-level idea:

  • Define a forward noising process that gradually corrupts data x0 into noise xT by adding Gaussian noise at each timestep t (Markov chain). Then learn a reverse denoising process parameterized by a neural network to recover x0 from xT.
  • Equivalent formulations: diffusion probabilistic models (DDPM), denoising score matching and score-based generative models (Song & Ermon). Score SDE unifies continuous-time view.

Forward process: x_t = sqrt(ᾱ_t) x_0 + sqrt(1 − ᾱ_t) ε, where ε ∼ N(0, I)

Denoising network θ is trained to predict noise ε or predict x0. Loss commonly used: L(θ) = E_{x0, ε, t}[ || ε - εθ(x_t, t) ||^2 ]

Sampling:

  • Start from pure noise x_T and iteratively apply the learned reverse transition to obtain x_{T−1}, ..., x_0.

Key properties:

  • Very high sample quality, stable training (simple MSE losses), good at scaling with compute/data.
  • Sampling requires many iterative denoising steps (e.g., hundreds to thousands), but techniques like DDIM, sampling acceleration, and distillation reduce steps.
  • Naturally supports classifier guidance and classifier-free guidance for conditioning.

Latent diffusion:

  • Apply diffusion in a lower-dimensional latent space (VAE-encoded) rather than pixel space to be much more computationally efficient (Stable Diffusion).

Equation (simplified DDPM denoising loss): L_simple = E_{x0, ε∼N(0,I), t} [ || ε - εθ(x_t, t) ||^2 ]

Pseudocode for sampling from a DDPM (very simplified):

Plain Text
1x_T = sample_normal() 2for t = T down to 1: 3 predicted_noise = model(x_t, t) 4 x_{t-1} = some_denoising_update(x_t, predicted_noise, t) 5return x_0

Advantages:

  • High fidelity, flexible conditioning, relatively stable training, effective with classifier-free guidance for text conditioning.

Latent spaces and representation learning

Latent spaces are continuous vector spaces where information about images is represented compactly. Important properties:

  • Smoothness: nearby points map to similar images.
  • Interpolation: linear or spherical interpolation in latent space yields smooth transitions in image space.
  • Arithmetic: vector arithmetic can control semantic attributes (to some extent).
  • Disentanglement: ideal latent axes correspond to distinct factors of variation (a research goal).

Uses:

  • Latent vectors are used for sampling, image editing (e.g., adding attribute vectors), image similarity search, style transfer, and accelerating sampling (diffusion in latent space).

Representative techniques:

  • VAE latent spaces (probabilistic).
  • Conditional latents (concatenate label or text embeddings).
  • Latent diffusion: encode image via VAE encoder into latent z; run diffusion model in latent space; decode with VAE decoder.

Conditioning and guidance

Most practical generation tasks are conditional: text-to-image, image-to-image, inpainting, super-resolution, style transfer.

Key mechanisms:

  1. Text encoders:

    • CLIP: contrastive image-text model that maps text and images to a shared embedding space. CLIP is widely used both for scoring generated images and for providing text embeddings for conditioning.
    • Transformer text encoders (e.g., T5, BERT-like encoders) used in some pipelines.
  2. Conditioning mechanisms:

    • Concatenation: include conditioning vectors to the model input.
    • Cross-attention: in diffusion U-Nets, cross-attention layers allow the denoiser to attend to text embeddings at each denoising step; this is the core of Stable Diffusion’s text conditioning.
    • Classifier guidance: use gradients from a separately trained classifier to steer sampling toward a desired class.
    • Classifier-free guidance: a widely used technique that trains the model with and without conditioning and at sampling time mixes the unconditional and conditional scores: s_guided = s_uncond + w * (s_cond - s_uncond) where w is guidance scale (>=1). This boosts adherence to conditioning while retaining diversity.
  3. Additional controls and extensions:

    • ControlNet: add extra conditioning channels (e.g., depth maps, edge maps, sketches) via trainable control networks connected to the diffusion denoiser.
    • Attention maps, spatial conditioning, and masking for inpainting.
    • Prompt engineering: crafting text prompts to shape style, composition, and attributes. Prompts are often combined with special tokens and weights.

Classifier-free guidance equation (score space): score_guided = score_uncond + scale * (score_cond - score_uncond)

This mechanism is crucial for modern text-to-image systems to balance fidelity to prompt versus sample quality/diversity.


Training data, losses, and evaluation metrics

Training data:

  • Image datasets historically: CIFAR, CelebA, LSUN, ImageNet.
  • For text-to-image: Conceptual Captions, COCO Captions, and large web-scale scraped datasets like LAION-400M/LAION-5B (image-text pairs).
  • Dataset quality, filtering, and licensing issues are critical.

Losses:

  • VAEs: ELBO = reconstruction loss + KL regularizer.
  • GANs: adversarial loss (plus auxiliary reconstruction/perceptual losses).
  • Diffusion: denoising MSE (predict noise) or score matching losses; optionally perceptual or adversarial losses when seeking higher visual fidelity.

Evaluation metrics:

  • Fréchet Inception Distance (FID): compares statistics of feature activations of generated vs real images. Lower is better.
  • Inception Score (IS): measures both classifiability and diversity; has limitations.
  • Kernel Inception Distance (KID): alternative to FID.
  • Precision and recall for generative models: measure fidelity (precision) and coverage (recall).
  • Human evaluation and downstream task utility assessments.
  • Text–image alignment metrics (for conditioned models): CLIP similarity between prompt and generated images.

Limitations:

  • FID and IS can be gamed and are sensitive to dataset and preprocessing choices; human evaluation remains important.

Sampling, speed-quality trade-offs, and practical engineering

Diffusion models historically required many denoising steps (e.g., 1000) producing high quality but slow sampling. Engineering work improved this:

  • DDIM (Denoising Diffusion Implicit Models): deterministic sampling with fewer steps while preserving quality.
  • Sampler schedulers: linear, cosine, quadratic noise schedules affect sample quality.
  • Distillation: distill many-step samplers into fewer-step samplers (e.g., progressive distillation).
  • Latent diffusion: run diffusion in compressed latent space — large speed and memory gains.
  • Model quantization, pruning, and mixed precision for deployment.

Sampling hyperparameters:

  • Guidance scale (classifier-free guidance): increases fidelity to conditioning but can reduce diversity or create artifacts if too large.
  • Temperature/eta: parameters controlling randomness/ancestry in samplers.

Practical tips for image generation:

  • Use classifier-free guidance with moderate scale (e.g., 5–15) for text prompts.
  • High-resolution generation often uses multi-stage pipelines (generate at lower resolution then super-resolve).
  • Conditioning and negative prompts: specify undesired attributes to reduce artifacts.
  • Seed control yields deterministic reproducible outputs.

Representative architectures and pipelines

We summarize the conceptual architecture for modern text-to-image pipelines, focusing on Stable Diffusion (Latent Diffusion) as a widely used, transparent example.

Stable Diffusion (conceptual components)

  • VAE Encoder/Decoder: learn to compress/decompress images to/from a latent z (much smaller spatial dimensions).
  • U-Net Denoiser: a U-Net conditioned on timestep t and text embeddings; operates in latent space z_t and predicts noise.
  • Text Encoder: CLIP text encoder producing embeddings; cross-attention layers in U-Net attend to these embeddings.
  • Scheduler/Sampling module: iteratively denoise from Gaussian latent z_T to z_0, then decode with VAE decoder to pixel space.

Pipeline (high-level):

  1. Encode text prompt with text encoder -> embeddings.
  2. Sample z_T ∼ N(0, I).
  3. For t = T..1: z_{t-1} = denoise_step(U-Net(z_t, t, embeddings), scheduler).
  4. Decode z_0 with VAE decoder to obtain image.

Latent diffusion dramatically lowers memory and flops vs pixel-space diffusion, enabling generation on consumer GPUs.

DALL·E 2 / Imagen overview (conceptual)

  • DALL·E 2: two-stage: first generate CLIP image embeddings conditioned on text (diffusion model in embedding space), then use diffusion to generate images conditioned on predicted embeddings. CLIP guidance is central.
  • Imagen: large text encoder + diffusion in pixel space with high-quality text understanding; showed strong text alignment.

Applications and examples

Applications across domains:

  • Creative content:
    • Digital art, illustrations, concept art, storyboarding.
    • Style transfer and remixing.
  • Content generation for media:
    • Advertising, marketing visuals, product mockups.
  • Design and prototyping:
    • UI mockups, fashion, interior design concepts.
  • Game and film production:
    • Asset generation, textures, background scenes.
  • Data augmentation and simulation:
    • Synthetic training images for machine learning (careful validation required).
  • Image editing:
    • Inpainting (fill masked regions), super-resolution, colorization.
  • Scientific and medical imaging:
    • Simulation and augmentation (requires strict validation and regulatory care).

Example prompt (text-to-image): "Photorealistic portrait of a young scientist in a lab, warm cinematic lighting, shallow depth of field, ultra-detailed, 35mm"

Practical workflow:

  • Iterate prompts, use negative prompts to avoid unwanted features (e.g., "low quality, watermark").
  • Use seed for reproducibility; use higher guidance scales to increase prompt adherence.
  • Apply inpainting or ControlNet for precise composition control.

Current state-of-the-art and benchmarks

As of 2024, diffusion-based models implemented with powerful text encoders dominate practical text-to-image generation:

  • Stable Diffusion family: accessible, fast (latent diffusion), extensible with ControlNet and fine-tuning.
  • Imagen and DALL·E 2 (and 3): show state-of-the-art alignment and high fidelity; DALL·E 3 improved prompt adherence via deeper integration with language models.
  • Midjourney: proprietary, stylistically distinct outputs.
  • Evaluation: model performance judged by human preference studies, CLIP alignment scores, FID; no single metric fully captures quality.

Scalability trends:

  • Improved results via larger models, higher-capacity text encoders, better dataset curation.
  • Improved techniques for efficiency: distillation, latent modeling, sparse attention, efficient U-Nets.

Generative image models pose important challenges:

  1. Copyright and training data:

    • Models trained on scraped images may reproduce copyrighted content or styles; legal frameworks are evolving.
    • Artists and creators have raised concerns about unauthorized use of their works.
  2. Misinformation and deepfakes:

    • Photorealistic generation enables convincing forgeries; detection and provenance are active areas (watermarking, forensic models).
  3. Bias and harms:

    • Datasets reflect societal biases (demographics, stereotypes); outputs can perpetuate or amplify biases.
    • Content policy and safety filters are needed but imperfect.
  4. Attribution and transparency:

    • Methods for model transparency, watermarking/generated content detection, and dataset reporting are strongly recommended.
  5. Socioeconomic impacts:

    • Disruption to creative jobs, but also new tools for creators and productivity gains.

Mitigations:

  • Model cards, dataset documentation (datasheets), provenance metadata embedding or watermarking, opt-out mechanisms for artists, licensing frameworks, improved dataset curation and filtering.

Future directions and open problems

Research and engineering directions likely to shape the next years:

  • Faster sampling: near real-time, single-digit step samplers via distillation.
  • Multimodal and multimodal reasoning: tighter integration with LLMs for layout, composition, and multimodal planning (image + text + video).
  • 3D-aware image generation: generating consistent multi-view images, neural radiance fields (NeRF)-like outputs, and direct 3D asset generation.
  • Personalization and controllability: user-specific models, style-locked generation, fine-grained attribute control.
  • Explainability and interpretability of latent and attention mechanisms to better understand and control outputs.
  • Robustness and safety: watermarking, detectable fingerprints, bias mitigation, adversarial robustness.
  • Legal and economic structures: licensing models and artist compensation mechanisms.

Open technical problems:

  • Compositional generalization: reliably composing novel concepts not seen jointly during training.
  • Faithful rendering of complex instructions (multi-object spatial relations).
  • Reducing hallucinations and unwanted artifacts in generated images.

Summary

Modern generative image systems are built on probabilistic modeling frameworks (VAEs, GANs, autoregressive, diffusion models) combined with powerful conditioning mechanisms (text encoders, cross-attention, classifier-free guidance). Diffusion-based approaches — particularly latent diffusion — have become dominant thanks to their combination of fidelity, stability, and flexible conditioning. Practical systems depend heavily on large, curated datasets, high compute, and careful engineering of training objectives, sampling algorithms, and conditioning mechanisms.

These technologies enable broad and beneficial applications but raise nontrivial ethical, legal, and societal challenges. Research continues on improving fidelity, efficiency, controllability, and safety, while policy and design choices will determine how generative image AI impacts society.


  • Goodfellow et al., "Generative Adversarial Nets", 2014.
  • Kingma & Welling, "Auto-Encoding Variational Bayes", 2013.
  • Oord et al., "Pixel Recurrent Neural Networks", 2016; "PixelCNN", 2016.
  • Sohl-Dickstein et al., "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", 2015.
  • Ho et al., "Denoising Diffusion Probabilistic Models", 2020.
  • Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution", 2019.
  • Nichol & Dhariwal, "Improved Denoising Diffusion Probabilistic Models", 2021.
  • Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models", 2022 (Stable Diffusion).
  • Radford et al., "Learning Transferable Visual Models from Natural Language Supervision (CLIP)", 2021.
  • Ramesh et al., "Zero-Shot Text-to-Image Generation" (DALL·E), 2021 / "DALL·E 2" followups.
  • Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" (Imagen), 2022.

Appendix: Minimal pseudocode examples

GAN training (very simplified):

Plain Text
1for iter in training_steps: 2 x_real = sample_real_images() 3 z = sample_noise() 4 x_fake = G(z).detach() 5 loss_D = BCE(D(x_real), 1) + BCE(D(x_fake), 0) 6 optimize(D, loss_D) 7 8 z = sample_noise() 9 x_fake = G(z) 10 loss_G = BCE(D(x_fake), 1) 11 optimize(G, loss_G)

DDPM (training and sampling, highly simplified):

Plain Text
1# Training 2for each x0 in dataset: 3 t = random_timestep() 4 epsilon = normal_sample() 5 xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * epsilon 6 loss = ||epsilon - model(xt, t)||^2 7 optimize(model, loss) 8 9# Sampling 10xt = normal_sample() 11for t = T downto 1: 12 predicted_eps = model(xt, t) 13 x_{t-1} = denoise_update(xt, predicted_eps, t) # per scheduler 14 xt = x_{t-1} 15return x0

Classifier-free guidance (sampling step):

Plain Text
1eps_cond = model(xt, t, cond=prompt) 2eps_uncond = model(xt, t, cond=None) 3eps_guided = eps_uncond + scale * (eps_cond - eps_uncond) 4# use eps_guided in denoising update

If you’d like, I can:

  • Provide a deeper mathematical derivation of diffusion score matching and reverse SDEs.
  • Show a concrete PyTorch example of a small diffusion model for MNIST.
  • Walk through Stable Diffusion’s architecture in more implementation detail (U-Net blocks, cross-attention code pattern).
  • Discuss a specific application (e.g., inpainting) with step-by-step recipes.