A learning path ready to make your own.

How does generative AI work?

Generative AI — Definition Generative AI are models that learn a data distribution p(x) (or conditional p(x|c)) and synthesize new examples resembling the training data: images, text, audio, video, 3D shapes, molecules, code, etc. They contrast with discriminative models that predict labels given inputs. Short history & milestones Pre-deep-learning: probabilistic models, mixtures, HMMs. 2013–2014: VAEs (reparameterization) and early autoregressive models. 2014: GANs introduced adversarial training for photorealistic images. Mid-2010s: Transformers (2017) enabled large-scale sequence models (GPT family). 2019–2022: Score-based & diffusion models (DDPM, score matching) regained prominence for image synthesis. 2020–2024: Scaling Transformers and latent diffusion (Stable Diffusion) drove major capability gains and multimodal systems. Main families of generative models Autoregressive: factorize p(x)=∏p(x_i|x_ Variational Autoencoders (VAEs): latent variable models maximizing ELBO with encoder q(z|x) and decoder p(x|z). Principled but can produce blurrier outputs. GANs: generator vs discriminator minimax game. Produces sharp images but can be unstable and lacks tractable likelihood. Flow-based models (Normalizing flows): invertible transforms with exact likelihood via change-of-variables (RealNVP, Glow). Exact likelihood but architectural constraints. Score-based / Diffusion models: learn denoising or score ∇_x log p_t(x) to reverse a noise process. State-of-the-art image quality; sampling typically requires many steps (DDPM, DDIM, latent diffusion). Core theoretical foundations Maximum likelihood and cross-entropy/perplexity for discrete sequences. Latent-variable ELBO: ELBO = E_{q(z|x)}[log p(x|z)] − KL(q||p) for VAEs; reparameterization enables gradient-based learning. Adversarial objectives for GANs: minimax game and practical variants (non‑saturating, WGAN) to improve stability. Score matching / diffusion objective: train denoiser or score network (e.g., L = E[||ε − ε_θ(x_t,t)||^2] in DDPM). Transformers & attention: attention(Q,K,V)=softmax(QK^T/√d_k)V; multi‑head attention + feed-forward layers are central to modern sequence models. Architectures & building blocks Encoders/decoders, CNNs and U‑Nets (images), RNNs (less common now), Transformers (dominant for text and many multimodal tasks). Attention mechanisms, positional encodings, residual connections, normalization layers. Graph and equivariant networks for molecules/3D data. Conditioning & guidance: prompt tokens, classifier guidance, classifier‑free guidance, fine-grained control methods. Training and sampling procedures Optimization: SGD variants (Adam/AdamW), large-scale pretraining, mixed precision, data augmentation. Autoregressive sampling: sequential token sampling with temperature, top‑k, nucleus (top‑p), or beam search. VAE/Flow/GAN sampling: draw latent z and decode (flows are invertible deterministic transforms). Diffusion sampling: start from noise and run learned reverse process (ancestral DDPM, deterministic DDIM, improved SDE solvers); classifier-free guidance amplifies conditioning. Evaluation metrics & limitations Images: FID, Inception Score, precision/recall, human eval. Text: Perplexity, task metrics (BLEU/ROUGE), but open-ended quality needs human judgment (coherence, factuality). Audio: MOS and reconstruction measures. Limitations: metrics can be gamed, may not capture diversity/fidelity trade-offs; human evaluation remains essential. Practical applications & representative systems Text: GPT family, Codex for code, chat assistants. Images: DALL·E, Imagen, Stable Diffusion, Midjourney (text‑to‑image). Audio/music: WaveNet, Jukebox, MusicLM. Video: early diffusion/autoregressive video models (computationally heavy). Science & design: molecular generation, materials discovery, 3D shape synthesis. Representative technologies: CLIP (contrastive vision‑language), Latent Diffusion, VQ‑VAE, RealNVP/Glow. Implementation patterns (high level) Autoregressive: sample logits with temperature/top‑k/top‑p. VAE: encoder → (μ,σ) → reparameterize z = μ + σ⊙ε → decode; optimize recon + KL. GAN: alternate discriminator and generator updates with adversarial losses. Diffusion: add noise at random t, train model to predict noise or denoised x_t; sample by reversing noise schedule. Challenges, risks & mitigations Risks: hallucinations, bias amplification, deepfakes, IP/privacy leakage, environmental cost. Mitigations: RLHF/alignment, safety filters, provenance/watermarking, differential privacy, dataset curation, audits and red‑teaming. Current state (mid‑2024) Autoregressive LMs dominate text; retrieval and grounding improve factuality. Diffusion models lead high‑fidelity image synthesis; latent diffusion enables efficient high‑res generation. Multimodal foundation models and open‑source releases (Stable Diffusion, LLaMA variants) widened access and innovation. Active research on faster sampling, controllability, safety, and alignment. Future directions & open problems Compute efficiency: sparsity, MoE, quantization, hardware/software co‑design. Better controllability, compositionality, and modular combination of models. Robust evaluation metrics correlating with human judgment. Long‑form multimodal coherence (long videos, narratives) and scientific discovery applications. Governance: watermarking, legal frameworks, accountability mechanisms. Recommended readings & resources Foundational papers: Goodfellow et al. (GANs), Kingma & Welling (VAEs), Vaswani et al. (Transformers), Ho et al. (DDPM), Song & Ermon (score‑based), Rombach et al. (Latent Diffusion). Textbooks/tutorials: Goodfellow/Bengio/Courville "Deep Learning", Hugging Face and OpenAI technical blogs, Stanford deep generative model courses. Concise takeaway Generative AI comprises complementary paradigms (autoregressive, VAE, GAN, flow, diffusion) that trade off likelihood tractability, sample quality, and compute. Transformers and diffusion methods are currently leading in text and image domains respectively. Progress focuses on efficiency, control, alignment, and safer deployment while broad applications span creative, scientific, and practical domains.

Let the lesson walk with you.

Podcast

How does generative AI work? podcast

0:00-3:40

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

How does generative AI work? flashcards

16 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

How does generative AI work? quiz

12 questions

What is the primary difference between a generative model and a discriminative model as described in the content?

Read deeper, connect wider, own the subject.

Deep Article

How does generative AI work?

Table of contents

  • Introduction and definition
  • Short history and milestones
  • Main families of generative models
  • Autoregressive models
  • Variational Autoencoders (VAEs)
  • Generative Adversarial Networks (GANs)
  • Flow-based models
  • Score-based / Diffusion models
  • Core theoretical foundations and math
  • Probability factorization and likelihood
  • Maximum likelihood and cross-entropy
  • Latent-variable modelling and ELBO
  • Adversarial objectives
  • Score matching and diffusion objective
  • Self-attention and transformer basics
  • Architectures and building blocks
  • Encoders/decoders, CNNs, RNNs, Transformers
  • Attention mechanism and positional encoding
  • Conditioning, control, and guidance
  • Training and sampling procedures
  • Training objectives and optimization
  • Sampling algorithms (ancestral, beam, nucleus/top-p, temperature)
  • Diffusion sampling (DDPM, deterministic samplers, classifier-free guidance)
  • Evaluation metrics and limitations
  • Practical applications and examples
  • Text, images, audio, video, molecules, code
  • Representative models and systems
  • Implementation examples (short code & pseudocode)
  • Challenges, risks, and safety mitigations
  • Current state of the field (as of mid-2024)
  • Future directions and open problems
  • Recommended readings and resources
  • Summary

Introduction and definition Generative AI refers to machine learning systems that learn a model of data and use it to generate new, previously unseen examples that resemble the training distribution. These models can produce images, text, audio, video, 3D shapes, molecules, code, and more. “Generative” contrasts with “discriminative” models: discriminative models predict labels given inputs, while generative models attempt to model the data distribution itself p(x) (or conditional p(x|c)).

Short history and milestones

  • Pre-deep-learning era: probabilistic models, mixture models, HMMs (1990s–2000s).
  • 2014: Generative Adversarial Networks (GANs) — Goodfellow et al., introduced adversarial training to generate realistic images.
  • 2013–2014: Reparameterization trick and VAEs (Kingma & Welling) enabled scalable variational learning for deep latent-variable models.
  • Mid-2010s: Autoregressive models like PixelRNN/PixelCNN for images and Transformer (Vaswani et al., 2017) for sequence modeling led to large language models (GPT series).
  • 2019–2022: Diffusion models and score-based models regained prominence for high-quality image generation (e.g., DDPM, 2020; Score-based models, Song & Ermon).
  • 2021–2023: Multimodal models (CLIP, ALIGN) and latent diffusion (Stable Diffusion) made high-resolution image synthesis efficient.
  • 2020–2024: Scaling Transformers produced dramatic gains in text generation, code generation, and few-shot learning (GPT-3, PaLM, LLaMA, etc.). Diffusion and autoregressive approaches are both dominant paradigms for different media.

Main families of generative models Generative models can be grouped roughly by how they model the data and how they generate samples.

1) Autoregressive models

  • Principle: Factorize joint probability p(x) as product of conditionals using chain rule: p(x) = ∏ p(xi | x q(z|x) (approximate posterior), decoder p(x|z) generates.
  • Key trick: reparameterization allows backprop through sampling.
  • Pros: Principled probabilistic framework, latent representations useful for manipulation.
  • Cons: Often produce blurry images (likelihood-driven models can average), trade-offs between reconstruction and regularization.

ELBO: log p(x) ≥ E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z))

3) Generative Adversarial Networks (GANs)

  • Principle: Two networks: generator G(z) that maps noise z to data space, discriminator D(x) that tries to distinguish real vs fake. Train via a minimax game.
  • Original objective (Goodfellow et al., 2014): minG maxD E[log D(x)] + E[log(1 - D(G(z)))]
  • Pros: Very sharp, realistic samples (especially for images).
  • Cons: Training instability, mode collapse, lack of explicit likelihood (harder to evaluate probability).

4) Flow-based models (Normalizing Flows)

  • Principle: Construct invertible transformations f that map data x to latent z with tractable Jacobian determinant, using change-of-variables formula. Allows exact log-likelihood and sampling.
  • Examples: RealNVP, Glow.
  • Pros: Exact likelihood, invertible sampling, efficient.
  • Cons: Architectural constraints (invertibility, Jacobian computation) can limit expressivity.

Change of variable: log p(x) = log p_z(f(x)) + log |det ∂f(x)/∂x|

5) Score-based / Diffusion models

  • Principle: Define a forward noising process q(xt | x{t-1}) that gradually adds Gaussian noise; learn a reverse denoising model pθ(x{t-1} | xt) or directly learn the score ∇x log p_t(x). Sampling runs the learned reverse process to map pure noise to data.
  • Important works: Sohl-Dickstein et al. (2015), Song & Ermon (score-based), Ho et al. (DDPM, 2020), Nichol & Dhariwal, and Latent Diffusion (Rombach et al., 2022).
  • Pros: State-of-the-art image quality, flexible conditioning methods, strong theoretical foundations via score matching and stochastic differential equations.
  • Cons: Sampling requires many steps (though efficient samplers and distillation reduce steps), computational cost.

Core theoretical foundations and math Probability factorization and likelihood

  • Maximum likelihood estimation (MLE) is a central principle: choose model parameters θ to maximize ∑i log pθ(x^(i)).
  • For autoregressive models, exact likelihood is tractable because of factorization.

Cross-entropy and perplexity

  • For discrete sequences, negative log-likelihood (cross-entropy) is the training loss. Perplexity is exp(average negative log-likelihood) often used for language models.

Latent-variable modelling and ELBO (VAE)

  • Goal: maximize log p(x). Because p(x) = ∫ p(x|z)p(z) dz is intractable, introduce q(z|x) and optimize ELBO:

ELBO(x) = E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z)) ≤ log p(x).

  • Reparameterization: z = μ + σ * ε, ε ∼ N(0, I) to allow backprop.

Adversarial objectives (GANs)

  • Minimax game with generator G and discriminator D:

minG maxD V(D, G) = E{x~pdata} [log D(x)] + E{z~p(z)} [log(1 - D(G(z)))]

  • Practical variants: non-saturating loss, least-squares GAN, Wasserstein GAN (WGAN with gradient penalty) to stabilize training.

Score matching and diffusion objective

  • Score-based methods approximate score ∇x log p(x). Denoising score matching objective trains a model sθ(x, t) to predict noise added at time t. In DDPM, simplified loss used:

L = E{x,ε,t} [|| ε - εθ(xt, t) ||^2] where xt = sqrt(ᾱt) x + sqrt(1 - ᾱt) ε.

Self-attention and transformer basics

  • Attention computes weighted sums of values V, where weights come from similarities between queries ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.