How does generative AI work?

Table of contents

  • Introduction and definition
  • Short history and milestones
  • Main families of generative models
    • Autoregressive models
    • Variational Autoencoders (VAEs)
    • Generative Adversarial Networks (GANs)
    • Flow-based models
    • Score-based / Diffusion models
  • Core theoretical foundations and math
    • Probability factorization and likelihood
    • Maximum likelihood and cross-entropy
    • Latent-variable modelling and ELBO
    • Adversarial objectives
    • Score matching and diffusion objective
    • Self-attention and transformer basics
  • Architectures and building blocks
    • Encoders/decoders, CNNs, RNNs, Transformers
    • Attention mechanism and positional encoding
    • Conditioning, control, and guidance
  • Training and sampling procedures
    • Training objectives and optimization
    • Sampling algorithms (ancestral, beam, nucleus/top-p, temperature)
    • Diffusion sampling (DDPM, deterministic samplers, classifier-free guidance)
  • Evaluation metrics and limitations
  • Practical applications and examples
    • Text, images, audio, video, molecules, code
    • Representative models and systems
  • Implementation examples (short code & pseudocode)
  • Challenges, risks, and safety mitigations
  • Current state of the field (as of mid-2024)
  • Future directions and open problems
  • Recommended readings and resources
  • Summary

Introduction and definition Generative AI refers to machine learning systems that learn a model of data and use it to generate new, previously unseen examples that resemble the training distribution. These models can produce images, text, audio, video, 3D shapes, molecules, code, and more. “Generative” contrasts with “discriminative” models: discriminative models predict labels given inputs, while generative models attempt to model the data distribution itself p(x) (or conditional p(x|c)).

Short history and milestones

  • Pre-deep-learning era: probabilistic models, mixture models, HMMs (1990s–2000s).
  • 2014: Generative Adversarial Networks (GANs) — Goodfellow et al., introduced adversarial training to generate realistic images.
  • 2013–2014: Reparameterization trick and VAEs (Kingma & Welling) enabled scalable variational learning for deep latent-variable models.
  • Mid-2010s: Autoregressive models like PixelRNN/PixelCNN for images and Transformer (Vaswani et al., 2017) for sequence modeling led to large language models (GPT series).
  • 2019–2022: Diffusion models and score-based models regained prominence for high-quality image generation (e.g., DDPM, 2020; Score-based models, Song & Ermon).
  • 2021–2023: Multimodal models (CLIP, ALIGN) and latent diffusion (Stable Diffusion) made high-resolution image synthesis efficient.
  • 2020–2024: Scaling Transformers produced dramatic gains in text generation, code generation, and few-shot learning (GPT-3, PaLM, LLaMA, etc.). Diffusion and autoregressive approaches are both dominant paradigms for different media.

Main families of generative models Generative models can be grouped roughly by how they model the data and how they generate samples.

  1. Autoregressive models
  • Principle: Factorize joint probability p(x) as product of conditionals using chain rule: p(x) = ∏ p(x_i | x_<i).
  • Examples: Recurrent LMs, Transformer-based models (GPT), PixelRNN/PixelCNN (images), WaveNet (audio).
  • Pros: Training is likelihood-based (stable), strong sample quality, exact factorization for likelihood evaluation.
  • Cons: Slow generation (sequential), can be expensive for high-dimensional data.

Mathematical form: p(x) = p(x1) p(x2 | x1) p(x3 | x1, x2) ...

  1. Variational Autoencoders (VAEs)
  • Principle: Introduce latent variable z and model p(x) = ∫ p(x|z)p(z) dz. Use variational inference to maximize Evidence Lower BOund (ELBO).
  • Encoder maps x -> q(z|x) (approximate posterior), decoder p(x|z) generates.
  • Key trick: reparameterization allows backprop through sampling.
  • Pros: Principled probabilistic framework, latent representations useful for manipulation.
  • Cons: Often produce blurry images (likelihood-driven models can average), trade-offs between reconstruction and regularization.

ELBO: log p(x) ≥ E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z))

  1. Generative Adversarial Networks (GANs)
  • Principle: Two networks: generator G(z) that maps noise z to data space, discriminator D(x) that tries to distinguish real vs fake. Train via a minimax game.
  • Original objective (Goodfellow et al., 2014): min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
  • Pros: Very sharp, realistic samples (especially for images).
  • Cons: Training instability, mode collapse, lack of explicit likelihood (harder to evaluate probability).
  1. Flow-based models (Normalizing Flows)
  • Principle: Construct invertible transformations f that map data x to latent z with tractable Jacobian determinant, using change-of-variables formula. Allows exact log-likelihood and sampling.
  • Examples: RealNVP, Glow.
  • Pros: Exact likelihood, invertible sampling, efficient.
  • Cons: Architectural constraints (invertibility, Jacobian computation) can limit expressivity.

Change of variable: log p(x) = log p_z(f(x)) + log |det ∂f(x)/∂x|

  1. Score-based / Diffusion models
  • Principle: Define a forward noising process q(x_t | x_{t-1}) that gradually adds Gaussian noise; learn a reverse denoising model p_θ(x_{t-1} | x_t) or directly learn the score ∇_x log p_t(x). Sampling runs the learned reverse process to map pure noise to data.
  • Important works: Sohl-Dickstein et al. (2015), Song & Ermon (score-based), Ho et al. (DDPM, 2020), Nichol & Dhariwal, and Latent Diffusion (Rombach et al., 2022).
  • Pros: State-of-the-art image quality, flexible conditioning methods, strong theoretical foundations via score matching and stochastic differential equations.
  • Cons: Sampling requires many steps (though efficient samplers and distillation reduce steps), computational cost.

Core theoretical foundations and math Probability factorization and likelihood

  • Maximum likelihood estimation (MLE) is a central principle: choose model parameters θ to maximize ∑_i log p_θ(x^(i)).
  • For autoregressive models, exact likelihood is tractable because of factorization.

Cross-entropy and perplexity

  • For discrete sequences, negative log-likelihood (cross-entropy) is the training loss. Perplexity is exp(average negative log-likelihood) often used for language models.

Latent-variable modelling and ELBO (VAE)

  • Goal: maximize log p(x). Because p(x) = ∫ p(x|z)p(z) dz is intractable, introduce q(z|x) and optimize ELBO: ELBO(x) = E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z)) ≤ log p(x).
  • Reparameterization: z = μ + σ * ε, ε ∼ N(0, I) to allow backprop.

Adversarial objectives (GANs)

  • Minimax game with generator G and discriminator D: min_G max_D V(D, G) = E_{xpdata} [log D(x)] + E_{zp(z)} [log(1 - D(G(z)))]
  • Practical variants: non-saturating loss, least-squares GAN, Wasserstein GAN (WGAN with gradient penalty) to stabilize training.

Score matching and diffusion objective

  • Score-based methods approximate score ∇x log p(x). Denoising score matching objective trains a model s_θ(x, t) to predict noise added at time t. In DDPM, simplified loss used: L = E{x,ε,t} [|| ε - ε_θ(x_t, t) ||^2] where x_t = sqrt(ᾱ_t) x + sqrt(1 - ᾱ_t) ε.

Self-attention and transformer basics

  • Attention computes weighted sums of values V, where weights come from similarities between queries Q and keys K: Attention(Q,K,V) = softmax(Q K^T / sqrt(d_k)) V
  • Transformer layer = Multi-head attention + feed-forward blocks, with residual connections and layer norm. Scales well for long-range dependencies and parallel computation.

Architectures and building blocks Encoders, decoders, CNNs, RNNs, Transformers

  • Images: convolutional architectures or U-Nets (common in diffusion models).
  • Text/sequences: Transformers are dominant.
  • Audio: WaveNet-like autoregressive, GANs, diffusion for spectrograms.
  • 3D & molecules: Graph neural networks, equivariant networks.

Conditioning, control, and guidance

  • Conditional generation uses conditioning c (text prompt, class label, sketch). Many methods:
    • Conditional autoregressive: include prompt tokens.
    • Conditional GAN: condition both G and D.
    • Classifier guidance (diffusion): use classifier gradient to bias sampling toward a class.
    • Classifier-free guidance: train model to handle unconditional and conditional inputs and combine predictions to amplify conditioning at sample time.
    • Prompt engineering and controlled sampling (temperature, top-k, nucleus).

Training and sampling procedures Training objectives and optimization

  • Typically use stochastic gradient descent-based optimizers (Adam, AdamW).
  • Large models trained over many steps on large datasets; regularization, data augmentation, and mixed precision are common.

Sampling algorithms

  • Autoregressive: sample tokens sequentially from p(x_t | x_<t) using:
    • Temperature scaling: softens distribution p_i ∝ exp(logits_i / T).
    • Top-k: restrict to top k tokens.
    • Nucleus (top-p): restrict to smallest set with cumulative mass ≥ p.
    • Beam search: keep top-B hypotheses (used in structured prediction; can harm diversity for open-ended generation).
  • GANs/Flows/VAEs: sample from latent distribution then transform deterministically (with invertible flows) or stochastically (VAE decoder) to data.
  • Diffusion: start from noise x_T ∼ N(0,I), run reverse process using learned denoiser via many steps; various samplers trade off speed/quality (ancestral DDPM, deterministic DDIM, improved SDE solvers).

Diffusion sampling (high-level pseudocode)

Plain Text
1x_T ∼ N(0, I) 2for t = T,...,1: 3 predict ε_θ(x_t, t) 4 compute mean μ_θ(x_t, t) 5 sample x_{t-1} ∼ N(μ_θ, Σ_t) # or deterministic update for DDIM 6return x_0

Evaluation metrics and limitations

  • Images: FID (Frechet Inception Distance), IS (Inception Score), precision/recall for distributions, human evaluations.
  • Text: Perplexity (likelihood), BLEU/ROUGE (for specific tasks), but open-ended generation requires human eval for coherence, factuality, creativity.
  • Audio: MOS (Mean Opinion Score), spectrogram reconstruction metrics.
  • General limitations: metrics can be gamed, don't fully capture diversity/fidelity tradeoffs, and human judgment remains important.

Practical applications and examples

  • Text: GPT-family (text generation, summarization, chat), code generation (Codex, AlphaCode).
  • Images: DALL·E, Imagen, Stable Diffusion, Midjourney (text-to-image synthesis).
  • Audio & Music: WaveNet, Jukebox, MusicLM.
  • Video: emerging diffusion-based and autoregressive models for short clips; heavy compute and alignment challenges.
  • Molecules & drug discovery: generative models for molecular graphs, diffusion and graph VAEs for inverse design.
  • 3D & CAD: generative models for shapes and meshes, helpful in design automation.
  • Data augmentation, simulation, compression, content creation, personalization, and interactive assistants.

Representative systems (non-exhaustive)

  • Autoregressive text: GPT-3 / GPT-4 (OpenAI), PaLM (Google), LLaMA (Meta).
  • Diffusion images: DDPM, Glide, Stable Diffusion (latent diffusion), Imagen.
  • Conditional multimodal: CLIP (contrastive vision-language), Flamingo (few-shot multimodal), DALL·E 2 (diffusion + CLIP guidance).
  • Flows & VAEs: RealNVP, Glow, VQ-VAE (vector quantized VAE used in discrete latent models).

Implementation examples (short code & pseudocode)

  1. Minimal autoregressive sampling (pseudo-PyTorch)
Python
1# logits: [vocab_size] 2def sample_token(logits, temperature=1.0, top_k=None, top_p=None): 3 probs = softmax(logits / temperature) 4 if top_k: 5 # zero out all but top_k probs 6 mask = probs < topk_threshold(probs, k=top_k) 7 probs[mask] = 0 8 if top_p: 9 # nucleus: keep smallest set with cumulative mass >= top_p 10 probs = top_p_masking(probs, p=top_p) 11 probs = probs / probs.sum() 12 return categorical_sample(probs)
  1. VAE reparameterization trick (PyTorch-like)
Python
1mu, logvar = encoder(x) # outputs params 2std = (0.5 * logvar).exp() 3eps = torch.randn_like(std) 4z = mu + eps * std # reparameterization 5recon = decoder(z) 6recon_loss = reconstruction_loss(recon, x) 7kl = 0.5 * torch.sum(mu**2 + std**2 - 1 - logvar) 8loss = recon_loss + beta * kl 9loss.backward()
  1. Simplified GAN training loop (pseudocode)
Python
1for real_batch in dataloader: 2 # Update discriminator 3 z = sample_noise(batch_size) 4 fake = G(z).detach() 5 d_loss = -E[log D(real)] - E[log(1 - D(fake))] 6 d_optimizer.step(d_loss) 7 8 # Update generator 9 z = sample_noise(batch_size) 10 fake = G(z) 11 g_loss = -E[log D(fake)] 12 g_optimizer.step(g_loss)
  1. Diffusion training objective (simplified)
Python
1# x: real data 2t = random_t() 3eps = randn_like(x) 4x_t = sqrt(alpha_bar[t]) * x + sqrt(1 - alpha_bar[t]) * eps 5pred_eps = model(x_t, t) 6loss = mse(pred_eps, eps) 7loss.backward()

Challenges, risks, and safety mitigations

  • Hallucinations and factual errors: LLMs can produce plausible-sounding but incorrect statements.
  • Bias and fairness: Models can reproduce or amplify societal biases in training data.
  • Deepfakes and misuse: Realistic synthetic media can be used maliciously.
  • Intellectual property and attribution: Generated content may replicate or derive from copyrighted works.
  • Privacy: Models may memorize and regurgitate sensitive training data.
  • Environmental/compute cost: Large models require enormous compute and energy. Mitigations:
  • Alignment techniques: RLHF (reinforcement learning from human feedback) to align outputs with preferences.
  • Safety filters: classifiers and rule-based blocking for harmful content.
  • Watermarking and provenance: embed identifiable signatures in generated outputs.
  • Differential privacy and dataset curation: reduce memorization of sensitive data.
  • Model audits, red teaming, and transparency on limitations.

Current state of the field (as of mid-2024)

  • Large autoregressive LMs remain dominant in text; retrieval-augmented generation and grounding (tools, search, external knowledge) are common for factuality.
  • Diffusion models lead for high-fidelity image synthesis; latent diffusion brings efficiency for high-res imagery and conditioning (text-to-image).
  • Multimodal models that combine text, vision, audio are advancing rapidly; foundation models power many downstream applications.
  • Efforts focus on controllability (style, structure, constraints), sample efficiency, faster sampling (few-step diffusion via distillation and improved samplers), and safety (alignment pipelines, RLHF).
  • Democratization: open-source models (Stable Diffusion, LLaMA variants) enable broad experimentation, while commercial models emphasize safety and scale.

Future directions and open problems

  • Efficiency and scaling: sparse models, Mixture-of-Experts, quantization, and hardware co-design to reduce compute.
  • Controllability and compositionality: better ways to precisely instruct models and combine them modularly.
  • Robust evaluation: improved metrics that correlate with human judgment for open-ended generation.
  • Long-form coherent multimodal content (long videos, interactive narratives).
  • Scientific discovery: generative models for molecules, materials, and complex simulations.
  • Regulation and governance: legal frameworks, standards for disclosure/watermarking, and responsibilities for deployment.
  • Alignment and reliable truthfulness: reducing hallucinations and making models verifiable and accountable.

Recommended readings and resources

  • Original and influential papers:
    • Goodfellow et al., 2014 — Generative Adversarial Nets (GANs)
    • Kingma & Welling, 2013 — Auto-Encoding Variational Bayes (VAEs)
    • Vaswani et al., 2017 — Attention Is All You Need (Transformers)
    • Oord et al., 2016 — WaveNet / PixelCNN papers
    • Sohl-Dickstein et al., 2015; Ho et al., 2020 (DDPM); Song & Ermon (score-based) — Diffusion models
    • Rombach et al., 2022 — Latent Diffusion (Stable Diffusion)
  • Tutorials and textbooks:
    • “Deep Learning” by Goodfellow, Bengio, Courville (background)
    • Stanford CS courses on deep generative models and modern blogs/tutorials (Hugging Face, Distill, OpenAI technical blog)

Summary Generative AI encompasses a variety of probabilistic and neural approaches to model and synthesize complex data distributions. Key paradigms—autoregressive models, VAEs, GANs, flow-based, and diffusion/score-based models—each trade off tractability, sample quality, and computational cost. Transformers and attention mechanisms revolutionized sequence modeling, leading to powerful language models; diffusion methods currently drive state-of-the-art image synthesis. Training combines probabilistic objectives, adversarial games, and denoising/score matching. Sampling strategies (temperature, top-k/p, classifier guidance, etc.) enable control over creativity and fidelity. The technology has wide application but raises significant ethical and safety concerns, so research now balances capability improvement with alignment, robustness, transparency, and efficient deployment.

If you’d like, I can:

  • Provide a concise cheat-sheet comparing models (pros/cons, best use cases).
  • Show a longer complete PyTorch example for a tiny Transformer or a small diffusion model for images.
  • Provide guidance on designing and training a conditional generative model for a specific application (images, text, molecules).