How does generative AI work?
Table of contents
- Introduction and definition
- Short history and milestones
- Main families of generative models
- Autoregressive models
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)
- Flow-based models
- Score-based / Diffusion models
- Core theoretical foundations and math
- Probability factorization and likelihood
- Maximum likelihood and cross-entropy
- Latent-variable modelling and ELBO
- Adversarial objectives
- Score matching and diffusion objective
- Self-attention and transformer basics
- Architectures and building blocks
- Encoders/decoders, CNNs, RNNs, Transformers
- Attention mechanism and positional encoding
- Conditioning, control, and guidance
- Training and sampling procedures
- Training objectives and optimization
- Sampling algorithms (ancestral, beam, nucleus/top-p, temperature)
- Diffusion sampling (DDPM, deterministic samplers, classifier-free guidance)
- Evaluation metrics and limitations
- Practical applications and examples
- Text, images, audio, video, molecules, code
- Representative models and systems
- Implementation examples (short code & pseudocode)
- Challenges, risks, and safety mitigations
- Current state of the field (as of mid-2024)
- Future directions and open problems
- Recommended readings and resources
- Summary
Introduction and definition Generative AI refers to machine learning systems that learn a model of data and use it to generate new, previously unseen examples that resemble the training distribution. These models can produce images, text, audio, video, 3D shapes, molecules, code, and more. “Generative” contrasts with “discriminative” models: discriminative models predict labels given inputs, while generative models attempt to model the data distribution itself p(x) (or conditional p(x|c)).
Short history and milestones
- Pre-deep-learning era: probabilistic models, mixture models, HMMs (1990s–2000s).
- 2014: Generative Adversarial Networks (GANs) — Goodfellow et al., introduced adversarial training to generate realistic images.
- 2013–2014: Reparameterization trick and VAEs (Kingma & Welling) enabled scalable variational learning for deep latent-variable models.
- Mid-2010s: Autoregressive models like PixelRNN/PixelCNN for images and Transformer (Vaswani et al., 2017) for sequence modeling led to large language models (GPT series).
- 2019–2022: Diffusion models and score-based models regained prominence for high-quality image generation (e.g., DDPM, 2020; Score-based models, Song & Ermon).
- 2021–2023: Multimodal models (CLIP, ALIGN) and latent diffusion (Stable Diffusion) made high-resolution image synthesis efficient.
- 2020–2024: Scaling Transformers produced dramatic gains in text generation, code generation, and few-shot learning (GPT-3, PaLM, LLaMA, etc.). Diffusion and autoregressive approaches are both dominant paradigms for different media.
Main families of generative models Generative models can be grouped roughly by how they model the data and how they generate samples.
1) Autoregressive models
- Principle: Factorize joint probability p(x) as product of conditionals using chain rule: p(x) = ∏ p(xi | x q(z|x) (approximate posterior), decoder p(x|z) generates.
- Key trick: reparameterization allows backprop through sampling.
- Pros: Principled probabilistic framework, latent representations useful for manipulation.
- Cons: Often produce blurry images (likelihood-driven models can average), trade-offs between reconstruction and regularization.
ELBO: log p(x) ≥ E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z))
3) Generative Adversarial Networks (GANs)
- Principle: Two networks: generator G(z) that maps noise z to data space, discriminator D(x) that tries to distinguish real vs fake. Train via a minimax game.
- Original objective (Goodfellow et al., 2014): minG maxD E[log D(x)] + E[log(1 - D(G(z)))]
- Pros: Very sharp, realistic samples (especially for images).
- Cons: Training instability, mode collapse, lack of explicit likelihood (harder to evaluate probability).
4) Flow-based models (Normalizing Flows)
- Principle: Construct invertible transformations f that map data x to latent z with tractable Jacobian determinant, using change-of-variables formula. Allows exact log-likelihood and sampling.
- Examples: RealNVP, Glow.
- Pros: Exact likelihood, invertible sampling, efficient.
- Cons: Architectural constraints (invertibility, Jacobian computation) can limit expressivity.
Change of variable: log p(x) = log p_z(f(x)) + log |det ∂f(x)/∂x|
5) Score-based / Diffusion models
- Principle: Define a forward noising process q(xt | x{t-1}) that gradually adds Gaussian noise; learn a reverse denoising model pθ(x{t-1} | xt) or directly learn the score ∇x log p_t(x). Sampling runs the learned reverse process to map pure noise to data.
- Important works: Sohl-Dickstein et al. (2015), Song & Ermon (score-based), Ho et al. (DDPM, 2020), Nichol & Dhariwal, and Latent Diffusion (Rombach et al., 2022).
- Pros: State-of-the-art image quality, flexible conditioning methods, strong theoretical foundations via score matching and stochastic differential equations.
- Cons: Sampling requires many steps (though efficient samplers and distillation reduce steps), computational cost.
Core theoretical foundations and math Probability factorization and likelihood
- Maximum likelihood estimation (MLE) is a central principle: choose model parameters θ to maximize ∑i log pθ(x^(i)).
- For autoregressive models, exact likelihood is tractable because of factorization.
Cross-entropy and perplexity
- For discrete sequences, negative log-likelihood (cross-entropy) is the training loss. Perplexity is exp(average negative log-likelihood) often used for language models.
Latent-variable modelling and ELBO (VAE)
- Goal: maximize log p(x). Because p(x) = ∫ p(x|z)p(z) dz is intractable, introduce q(z|x) and optimize ELBO:
ELBO(x) = E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z)) ≤ log p(x).
- Reparameterization: z = μ + σ * ε, ε ∼ N(0, I) to allow backprop.
Adversarial objectives (GANs)
- Minimax game with generator G and discriminator D:
minG maxD V(D, G) = E{x~pdata} [log D(x)] + E{z~p(z)} [log(1 - D(G(z)))]
- Practical variants: non-saturating loss, least-squares GAN, Wasserstein GAN (WGAN with gradient penalty) to stabilize training.
Score matching and diffusion objective
- Score-based methods approximate score ∇x log p(x). Denoising score matching objective trains a model sθ(x, t) to predict noise added at time t. In DDPM, simplified loss used:
L = E{x,ε,t} [|| ε - εθ(xt, t) ||^2] where xt = sqrt(ᾱt) x + sqrt(1 - ᾱt) ε.
Self-attention and transformer basics
- Attention computes weighted sums of values V, where weights come from similarities between queries ...