How does generative AI work?

May 10, 2026··

12 min read

Table of contents

Introduction and definition
Short history and milestones
Main families of generative models
- Autoregressive models
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)
- Flow-based models
- Score-based / Diffusion models
Core theoretical foundations and math
- Probability factorization and likelihood
- Maximum likelihood and cross-entropy
- Latent-variable modelling and ELBO
- Adversarial objectives
- Score matching and diffusion objective
- Self-attention and transformer basics
Architectures and building blocks
- Encoders/decoders, CNNs, RNNs, Transformers
- Attention mechanism and positional encoding
- Conditioning, control, and guidance
Training and sampling procedures
- Training objectives and optimization
- Sampling algorithms (ancestral, beam, nucleus/top-p, temperature)
- Diffusion sampling (DDPM, deterministic samplers, classifier-free guidance)
Evaluation metrics and limitations
Practical applications and examples
- Text, images, audio, video, molecules, code
- Representative models and systems
Implementation examples (short code & pseudocode)
Challenges, risks, and safety mitigations
Current state of the field (as of mid-2024)
Future directions and open problems
Recommended readings and resources
Summary

Introduction and definition Generative AI refers to machine learning systems that learn a model of data and use it to generate new, previously unseen examples that resemble the training distribution. These models can produce images, text, audio, video, 3D shapes, molecules, code, and more. “Generative” contrasts with “discriminative” models: discriminative models predict labels given inputs, while generative models attempt to model the data distribution itself p(x) (or conditional p(x|c)).

Short history and milestones

Pre-deep-learning era: probabilistic models, mixture models, HMMs (1990s–2000s).
2014: Generative Adversarial Networks (GANs) — Goodfellow et al., introduced adversarial training to generate realistic images.
2013–2014: Reparameterization trick and VAEs (Kingma & Welling) enabled scalable variational learning for deep latent-variable models.
Mid-2010s: Autoregressive models like PixelRNN/PixelCNN for images and Transformer (Vaswani et al., 2017) for sequence modeling led to large language models (GPT series).
2019–2022: Diffusion models and score-based models regained prominence for high-quality image generation (e.g., DDPM, 2020; Score-based models, Song & Ermon).
2021–2023: Multimodal models (CLIP, ALIGN) and latent diffusion (Stable Diffusion) made high-resolution image synthesis efficient.
2020–2024: Scaling Transformers produced dramatic gains in text generation, code generation, and few-shot learning (GPT-3, PaLM, LLaMA, etc.). Diffusion and autoregressive approaches are both dominant paradigms for different media.

Main families of generative models Generative models can be grouped roughly by how they model the data and how they generate samples.

Autoregressive models

Principle: Factorize joint probability p(x) as product of conditionals using chain rule: p(x) = ∏ p(x_i | x_<i).
Examples: Recurrent LMs, Transformer-based models (GPT), PixelRNN/PixelCNN (images), WaveNet (audio).
Pros: Training is likelihood-based (stable), strong sample quality, exact factorization for likelihood evaluation.
Cons: Slow generation (sequential), can be expensive for high-dimensional data.

Mathematical form: p(x) = p(x1) p(x2 | x1) p(x3 | x1, x2) ...

Variational Autoencoders (VAEs)

Principle: Introduce latent variable z and model p(x) = ∫ p(x|z)p(z) dz. Use variational inference to maximize Evidence Lower BOund (ELBO).
Encoder maps x -> q(z|x) (approximate posterior), decoder p(x|z) generates.
Key trick: reparameterization allows backprop through sampling.
Pros: Principled probabilistic framework, latent representations useful for manipulation.
Cons: Often produce blurry images (likelihood-driven models can average), trade-offs between reconstruction and regularization.

ELBO: log p(x) ≥ E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z))

Generative Adversarial Networks (GANs)

Principle: Two networks: generator G(z) that maps noise z to data space, discriminator D(x) that tries to distinguish real vs fake. Train via a minimax game.
Original objective (Goodfellow et al., 2014): min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
Pros: Very sharp, realistic samples (especially for images).
Cons: Training instability, mode collapse, lack of explicit likelihood (harder to evaluate probability).

Flow-based models (Normalizing Flows)

Principle: Construct invertible transformations f that map data x to latent z with tractable Jacobian determinant, using change-of-variables formula. Allows exact log-likelihood and sampling.
Examples: RealNVP, Glow.
Pros: Exact likelihood, invertible sampling, efficient.
Cons: Architectural constraints (invertibility, Jacobian computation) can limit expressivity.

Change of variable: log p(x) = log p_z(f(x)) + log |det ∂f(x)/∂x|

Score-based / Diffusion models

Principle: Define a forward noising process q(x_t | x_{t-1}) that gradually adds Gaussian noise; learn a reverse denoising model p_θ(x_{t-1} | x_t) or directly learn the score ∇_x log p_t(x). Sampling runs the learned reverse process to map pure noise to data.
Important works: Sohl-Dickstein et al. (2015), Song & Ermon (score-based), Ho et al. (DDPM, 2020), Nichol & Dhariwal, and Latent Diffusion (Rombach et al., 2022).
Pros: State-of-the-art image quality, flexible conditioning methods, strong theoretical foundations via score matching and stochastic differential equations.
Cons: Sampling requires many steps (though efficient samplers and distillation reduce steps), computational cost.

Core theoretical foundations and math Probability factorization and likelihood

Maximum likelihood estimation (MLE) is a central principle: choose model parameters θ to maximize ∑_i log p_θ(x^(i)).
For autoregressive models, exact likelihood is tractable because of factorization.

Cross-entropy and perplexity

For discrete sequences, negative log-likelihood (cross-entropy) is the training loss. Perplexity is exp(average negative log-likelihood) often used for language models.

Latent-variable modelling and ELBO (VAE)

Goal: maximize log p(x). Because p(x) = ∫ p(x|z)p(z) dz is intractable, introduce q(z|x) and optimize ELBO: ELBO(x) = E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z)) ≤ log p(x).
Reparameterization: z = μ + σ * ε, ε ∼ N(0, I) to allow backprop.

Adversarial objectives (GANs)

Minimax game with generator G and discriminator D: min_G max_D V(D, G) = E_{x~~pdata} [log D(x)] + E_{z~~p(z)} [log(1 - D(G(z)))]
Practical variants: non-saturating loss, least-squares GAN, Wasserstein GAN (WGAN with gradient penalty) to stabilize training.

Score matching and diffusion objective

Score-based methods approximate score ∇x log p(x). Denoising score matching objective trains a model s_θ(x, t) to predict noise added at time t. In DDPM, simplified loss used: L = E{x,ε,t} [|| ε - ε_θ(x_t, t) ||^2] where x_t = sqrt(ᾱ_t) x + sqrt(1 - ᾱ_t) ε.

Self-attention and transformer basics

Attention computes weighted sums of values V, where weights come from similarities between queries Q and keys K: Attention(Q,K,V) = softmax(Q K^T / sqrt(d_k)) V
Transformer layer = Multi-head attention + feed-forward blocks, with residual connections and layer norm. Scales well for long-range dependencies and parallel computation.

Architectures and building blocks Encoders, decoders, CNNs, RNNs, Transformers

Images: convolutional architectures or U-Nets (common in diffusion models).
Text/sequences: Transformers are dominant.
Audio: WaveNet-like autoregressive, GANs, diffusion for spectrograms.
3D & molecules: Graph neural networks, equivariant networks.

Conditioning, control, and guidance

Conditional generation uses conditioning c (text prompt, class label, sketch). Many methods:
- Conditional autoregressive: include prompt tokens.
- Conditional GAN: condition both G and D.
- Classifier guidance (diffusion): use classifier gradient to bias sampling toward a class.
- Classifier-free guidance: train model to handle unconditional and conditional inputs and combine predictions to amplify conditioning at sample time.
- Prompt engineering and controlled sampling (temperature, top-k, nucleus).

Training and sampling procedures Training objectives and optimization

Typically use stochastic gradient descent-based optimizers (Adam, AdamW).
Large models trained over many steps on large datasets; regularization, data augmentation, and mixed precision are common.

Sampling algorithms

Autoregressive: sample tokens sequentially from p(x_t | x_<t) using:
- Temperature scaling: softens distribution p_i ∝ exp(logits_i / T).
- Top-k: restrict to top k tokens.
- Nucleus (top-p): restrict to smallest set with cumulative mass ≥ p.
- Beam search: keep top-B hypotheses (used in structured prediction; can harm diversity for open-ended generation).
GANs/Flows/VAEs: sample from latent distribution then transform deterministically (with invertible flows) or stochastically (VAE decoder) to data.
Diffusion: start from noise x_T ∼ N(0,I), run reverse process using learned denoiser via many steps; various samplers trade off speed/quality (ancestral DDPM, deterministic DDIM, improved SDE solvers).

Diffusion sampling (high-level pseudocode)

Plain Text

x_T ∼ N(0, I)
for t = T,...,1:
    predict ε_θ(x_t, t)
    compute mean μ_θ(x_t, t)
    sample x_{t-1} ∼ N(μ_θ, Σ_t)  # or deterministic update for DDIM
return x_0

Evaluation metrics and limitations

Images: FID (Frechet Inception Distance), IS (Inception Score), precision/recall for distributions, human evaluations.
Text: Perplexity (likelihood), BLEU/ROUGE (for specific tasks), but open-ended generation requires human eval for coherence, factuality, creativity.
Audio: MOS (Mean Opinion Score), spectrogram reconstruction metrics.
General limitations: metrics can be gamed, don't fully capture diversity/fidelity tradeoffs, and human judgment remains important.

Practical applications and examples

Text: GPT-family (text generation, summarization, chat), code generation (Codex, AlphaCode).
Images: DALL·E, Imagen, Stable Diffusion, Midjourney (text-to-image synthesis).
Audio & Music: WaveNet, Jukebox, MusicLM.
Video: emerging diffusion-based and autoregressive models for short clips; heavy compute and alignment challenges.
Molecules & drug discovery: generative models for molecular graphs, diffusion and graph VAEs for inverse design.
3D & CAD: generative models for shapes and meshes, helpful in design automation.
Data augmentation, simulation, compression, content creation, personalization, and interactive assistants.

Representative systems (non-exhaustive)

Autoregressive text: GPT-3 / GPT-4 (OpenAI), PaLM (Google), LLaMA (Meta).
Diffusion images: DDPM, Glide, Stable Diffusion (latent diffusion), Imagen.
Conditional multimodal: CLIP (contrastive vision-language), Flamingo (few-shot multimodal), DALL·E 2 (diffusion + CLIP guidance).
Flows & VAEs: RealNVP, Glow, VQ-VAE (vector quantized VAE used in discrete latent models).

Implementation examples (short code & pseudocode)

Minimal autoregressive sampling (pseudo-PyTorch)

Python

# logits: [vocab_size]
def sample_token(logits, temperature=1.0, top_k=None, top_p=None):
    probs = softmax(logits / temperature)
    if top_k:
        # zero out all but top_k probs
        mask = probs < topk_threshold(probs, k=top_k)
        probs[mask] = 0
    if top_p:
        # nucleus: keep smallest set with cumulative mass >= top_p
        probs = top_p_masking(probs, p=top_p)
    probs = probs / probs.sum()
    return categorical_sample(probs)

VAE reparameterization trick (PyTorch-like)

Python

mu, logvar = encoder(x)       # outputs params
std = (0.5 * logvar).exp()
eps = torch.randn_like(std)
z = mu + eps * std            # reparameterization
recon = decoder(z)
recon_loss = reconstruction_loss(recon, x)
kl = 0.5 * torch.sum(mu**2 + std**2 - 1 - logvar)
loss = recon_loss + beta * kl
loss.backward()

Simplified GAN training loop (pseudocode)

Python

for real_batch in dataloader:
    # Update discriminator
    z = sample_noise(batch_size)
    fake = G(z).detach()
    d_loss = -E[log D(real)] - E[log(1 - D(fake))]
    d_optimizer.step(d_loss)

    # Update generator
    z = sample_noise(batch_size)
    fake = G(z)
    g_loss = -E[log D(fake)]
    g_optimizer.step(g_loss)

Diffusion training objective (simplified)

Python

# x: real data
t = random_t()
eps = randn_like(x)
x_t = sqrt(alpha_bar[t]) * x + sqrt(1 - alpha_bar[t]) * eps
pred_eps = model(x_t, t)
loss = mse(pred_eps, eps)
loss.backward()

Challenges, risks, and safety mitigations

Hallucinations and factual errors: LLMs can produce plausible-sounding but incorrect statements.
Bias and fairness: Models can reproduce or amplify societal biases in training data.
Deepfakes and misuse: Realistic synthetic media can be used maliciously.
Intellectual property and attribution: Generated content may replicate or derive from copyrighted works.
Privacy: Models may memorize and regurgitate sensitive training data.
Environmental/compute cost: Large models require enormous compute and energy. Mitigations:
Alignment techniques: RLHF (reinforcement learning from human feedback) to align outputs with preferences.
Safety filters: classifiers and rule-based blocking for harmful content.
Watermarking and provenance: embed identifiable signatures in generated outputs.
Differential privacy and dataset curation: reduce memorization of sensitive data.
Model audits, red teaming, and transparency on limitations.

Current state of the field (as of mid-2024)

Large autoregressive LMs remain dominant in text; retrieval-augmented generation and grounding (tools, search, external knowledge) are common for factuality.
Diffusion models lead for high-fidelity image synthesis; latent diffusion brings efficiency for high-res imagery and conditioning (text-to-image).
Multimodal models that combine text, vision, audio are advancing rapidly; foundation models power many downstream applications.
Efforts focus on controllability (style, structure, constraints), sample efficiency, faster sampling (few-step diffusion via distillation and improved samplers), and safety (alignment pipelines, RLHF).
Democratization: open-source models (Stable Diffusion, LLaMA variants) enable broad experimentation, while commercial models emphasize safety and scale.

Future directions and open problems

Efficiency and scaling: sparse models, Mixture-of-Experts, quantization, and hardware co-design to reduce compute.
Controllability and compositionality: better ways to precisely instruct models and combine them modularly.
Robust evaluation: improved metrics that correlate with human judgment for open-ended generation.
Long-form coherent multimodal content (long videos, interactive narratives).
Scientific discovery: generative models for molecules, materials, and complex simulations.
Regulation and governance: legal frameworks, standards for disclosure/watermarking, and responsibilities for deployment.
Alignment and reliable truthfulness: reducing hallucinations and making models verifiable and accountable.