A learning path ready to make your own.

How does generative AI work?

Generative AI — Definition Generative AI are models that learn a data distribution p(x) (or conditional p(x|c)) and synthesize new examples resembling the training data: images, text, audio, video, 3D shapes, molecules, code, etc. They contrast with discriminative models that predict labels given inputs. Short history & milestones Pre-deep-learning: probabilistic models, mixtures, HMMs. 2013–2014: VAEs (reparameterization) and early autoregressive models. 2014: GANs introduced adversarial training for photorealistic images. Mid-2010s: Transformers (2017) enabled large-scale sequence models (GPT family). 2019–2022: Score-based & diffusion models (DDPM, score matching) regained prominence for image synthesis. 2020–2024: Scaling Transformers and latent diffusion (Stable Diffusion) drove major capability gains and multimodal systems. Main families of generative models Autoregressive: factorize p(x)=∏p(x_i|x_ Variational Autoencoders (VAEs): latent variable models maximizing ELBO with encoder q(z|x) and decoder p(x|z). Principled but can produce blurrier outputs. GANs: generator vs discriminator minimax game. Produces sharp images but can be unstable and lacks tractable likelihood. Flow-based models (Normalizing flows): invertible transforms with exact likelihood via change-of-variables (RealNVP, Glow). Exact likelihood but architectural constraints. Score-based / Diffusion models: learn denoising or score ∇_x log p_t(x) to reverse a noise process. State-of-the-art image quality; sampling typically requires many steps (DDPM, DDIM, latent diffusion). Core theoretical foundations Maximum likelihood and cross-entropy/perplexity for discrete sequences. Latent-variable ELBO: ELBO = E_{q(z|x)}[log p(x|z)] − KL(q||p) for VAEs; reparameterization enables gradient-based learning. Adversarial objectives for GANs: minimax game and practical variants (non‑saturating, WGAN) to improve stability. Score matching / diffusion objective: train denoiser or score network (e.g., L = E[||ε − ε_θ(x_t,t)||^2] in DDPM). Transformers & attention: attention(Q,K,V)=softmax(QK^T/√d_k)V; multi‑head attention + feed-forward layers are central to modern sequence models. Architectures & building blocks Encoders/decoders, CNNs and U‑Nets (images), RNNs (less common now), Transformers (dominant for text and many multimodal tasks). Attention mechanisms, positional encodings, residual connections, normalization layers. Graph and equivariant networks for molecules/3D data. Conditioning & guidance: prompt tokens, classifier guidance, classifier‑free guidance, fine-grained control methods. Training and sampling procedures Optimization: SGD variants (Adam/AdamW), large-scale pretraining, mixed precision, data augmentation. Autoregressive sampling: sequential token sampling with temperature, top‑k, nucleus (top‑p), or beam search. VAE/Flow/GAN sampling: draw latent z and decode (flows are invertible deterministic transforms). Diffusion sampling: start from noise and run learned reverse process (ancestral DDPM, deterministic DDIM, improved SDE solvers); classifier-free guidance amplifies conditioning. Evaluation metrics & limitations Images: FID, Inception Score, precision/recall, human eval. Text: Perplexity, task metrics (BLEU/ROUGE), but open-ended quality needs human judgment (coherence, factuality). Audio: MOS and reconstruction measures. Limitations: metrics can be gamed, may not capture diversity/fidelity trade-offs; human evaluation remains essential. Practical applications & representative systems Text: GPT family, Codex for code, chat assistants. Images: DALL·E, Imagen, Stable Diffusion, Midjourney (text‑to‑image). Audio/music: WaveNet, Jukebox, MusicLM. Video: early diffusion/autoregressive video models (computationally heavy). Science & design: molecular generation, materials discovery, 3D shape synthesis. Representative technologies: CLIP (contrastive vision‑language), Latent Diffusion, VQ‑VAE, RealNVP/Glow. Implementation patterns (high level) Autoregressive: sample logits with temperature/top‑k/top‑p. VAE: encoder → (μ,σ) → reparameterize z = μ + σ⊙ε → decode; optimize recon + KL. GAN: alternate discriminator and generator updates with adversarial losses. Diffusion: add noise at random t, train model to predict noise or denoised x_t; sample by reversing noise schedule. Challenges, risks & mitigations Risks: hallucinations, bias amplification, deepfakes, IP/privacy leakage, environmental cost. Mitigations: RLHF/alignment, safety filters, provenance/watermarking, differential privacy, dataset curation, audits and red‑teaming. Current state (mid‑2024) Autoregressive LMs dominate text; retrieval and grounding improve factuality. Diffusion models lead high‑fidelity image synthesis; latent diffusion enables efficient high‑res generation. Multimodal foundation models and open‑source releases (Stable Diffusion, LLaMA variants) widened access and innovation. Active research on faster sampling, controllability, safety, and alignment. Future directions & open problems Compute efficiency: sparsity, MoE, quantization, hardware/software co‑design. Better controllability, compositionality, and modular combination of models. Robust evaluation metrics correlating with human judgment. Long‑form multimodal coherence (long videos, narratives) and scientific discovery applications. Governance: watermarking, legal frameworks, accountability mechanisms. Recommended readings & resources Foundational papers: Goodfellow et al. (GANs), Kingma & Welling (VAEs), Vaswani et al. (Transformers), Ho et al. (DDPM), Song & Ermon (score‑based), Rombach et al. (Latent Diffusion). Textbooks/tutorials: Goodfellow/Bengio/Courville "Deep Learning", Hugging Face and OpenAI technical blogs, Stanford deep generative model courses. Concise takeaway Generative AI comprises complementary paradigms (autoregressive, VAE, GAN, flow, diffusion) that trade off likelihood tractability, sample quality, and compute. Transformers and diffusion methods are currently leading in text and image domains respectively. Progress focuses on efficiency, control, alignment, and safer deployment while broad applications span creative, scientific, and practical domains.

Open full tree

Follow the trail that experts already trust.

Resources