Generative AI explained for beginners
Table of contents
- Introduction — what is generative AI?
- Short history and milestones
- Key concepts and terminology
- Core model families (intuitions and differences)
- Autoregressive / Transformer language models
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)
- Diffusion and score-based models
- Normalizing flows and other families
- Theoretical foundations (simple, non-technical)
- Probability, likelihood, and sampling
- Optimization objectives: MLE, adversarial loss, variational lower bound, score matching
- Latent spaces and representation learning
- How generative models are trained (practical view)
- Datasets and preprocessing
- Compute and hardware considerations
- Fine-tuning, few-shot and in-context learning
- Safety, alignment, and RLHF (overview)
- How generative AI is used today (applications, examples)
- Text: writing, summarization, code generation, chatbots
- Images and video: image synthesis, editing, inpainting, video generation
- Audio and music: speech synthesis, music generation
- Science and design: molecules, materials, drug discovery, CAD
- Business: personalization, marketing assets, automation
- Example workflows & short code examples
- Simple text generation with Transformers
- Simple image generation with Diffusers (Stable Diffusion)
- Evaluation: how we measure quality
- Automated metrics and human evaluation
- Strengths, limitations and risks
- Hallucinations, bias, copyright, misuse
- Environmental and compute costs
- Best practices for beginners and practitioners
- Responsible usage checklist
- Prompt engineering basics
- When to fine-tune vs use prompting vs RAG
- The future: trends and likely directions
- Glossary (short)
- Key references and resources to learn more
- Frequently asked questions (brief)
Introduction — what is generative AI? Generative AI refers to algorithms and models that create new content: text, images, audio, video, 3D shapes, molecular structures and more. Instead of just predicting a label, these models learn to produce samples that resemble the data they were trained on. The goal is to model the underlying distribution of real-world data and draw samples from it — in practice this looks like writing stories, producing photorealistic images, composing music, generating software code, or designing molecules.
Short history and milestones
- Pre-2010s: basic probabilistic models (n-grams), early neural generative models.
- 2013: Variational Autoencoder (Kingma & Welling) formalizes latent-variable generative modeling.
- 2014: Generative Adversarial Networks (GANs, Goodfellow et al.) introduce adversarial training and achieve sharp image samples.
- 2015–2019: Progress in GAN stabilization, conditional GANs, and autoregressive models (PixelRNN/PixelCNN).
- 2017: Transformer architecture (Vaswani et al.) revolutionizes sequence modeling; later used as basis for GPT family.
- 2018–2021: Large language models (GPT, BERT derivatives) show impressive zero-/few-shot abilities.
- 2020: Diffusion models (Ho et al.) revived and scaled to achieve state-of-the-art image generation.
- 2021–2023: Multimodal models (text-to-image like DALL·E, Imagen, Stable Diffusion) and RLHF used to align LLM behavior.
- 2022 onward: Widespread public use, APIs, rapid democratization of tools.
Key concepts and terminology
- Sample: a single output from a generative model (e.g., an image or a sentence).
- Latent space: a lower-dimensional representation learned by a model (commonly used in VAEs and diffusion in latent space).
- Conditioning: providing extra inputs to steer generation (e.g., a caption for an image, a prompt for text).
- Token: discrete unit for text input to language models (subwords, characters, words).
- Prompt: user-provided text to instruct a model what to generate.
- Fine-tuning vs prompting: fine-tuning adjusts model weights on new data; prompting gives instructions at inference time (possibly with examples).
- RLHF (Reinforcement Learning from Human Feedback): technique used to make model outputs align better with human preferences.
Core model families (intuitions and differences) Autoregressive / Transformer language models
- Idea: model the probability of a sequence by predicting each element conditioned on previous ones: p(x) = prod p(xt | x{<t}).
- Architectures: transformer decoders (GPT-style) are the dominant approach.
- Strengths: simple sampling, strong text coherence, excellent few-shot abilities when large.
- Use-cases: text generation, code generation, chatbots, summarization (often with encoder-decoder versions like T5).
Variational Autoencoders (VAEs)
- Idea: learn a mapping from observed data to a latent distribution and back; train by maximizing a lower bound on likelihood (ELBO).
- Strengths: explicit latent space, smooth interpolation, efficient sampling.
- Limitations: often blurrier image samples compared to GANs/diffusion without enhancements.
Generative Adversarial Networks (GANs)
- Idea: two networks — a generator makes fake samples and a discriminator tries to tell fake from real; they train adversarially.
- Strengths: sharp, high-quality images; powerful for unconditional and conditional image generation.
- Limitations: training instability, mode collapse (generator produces limited diversity), trickiness in scaling to other domains.
Diffusion and score-based models
- Idea: gradually corrupt data with noise, then learn to reverse this process. Sampling runs the learned reverse diffusion to produce clean samples from noise.
- Strengths: state-of-the-art image generation quality, stable training, flexible conditioning (text-to-image), good likelihoods in some formulations.
- Notable: Latent Diffusion (Stable Diffusion) moves diffusion into a lower-dimensional latent space for computational efficiency.
Normalizing flows and autoregressive flows
- Idea: transform a simple distribution (e.g., Gaussian) into a complex one using invertible mappings with tractable Jacobians.
- Strengths: exact likelihoods, reversible mapping between data and latent.
- Limitations: architectural constraints for tractability; not always as sample-quality strong as diffusion/GANs in high dimensions.
Theoretical foundations (simple, non-technical) Probability, likelihood, and sampling
- Generative models aim to approximate a data distribution p*(x).
- Once a model q(x) approximates p*(x), we sample from q(x) to generate new data.
- Training often means minimizing a divergence (like KL divergence) between p* and q or maximizing likelihood.
Optimization objectives (intuitions)
- Maximum Likelihood Estimation (MLE): choose parameters to make training data as probable as possible under the model.
- Adversarial loss (GAN): train generator to fool discriminator; discriminator learns to distinguish.
- Variational inference (VAEs): use a tractable lower bound (ELBO) to learn both encoder and decoder.
- Score matching / denoising score matching: learn gradients (scores) of log-density to reverse noise processes (diffusion).
Latent spaces and representation learning
- Many models learn a compact latent representation z of input x.
- Latent spaces make manipulation easier — interpolation, arithmetic (celebrated but often noisy), disentanglement attempts.
- Latent-based diffusion (e.g., Latent Diffusion) uses encoder/decoder to make diffusion efficient.
How generative models are trained (practical view) Datasets and preprocessing
- Large-scale, high-quality datasets improve realism but raise data governance concerns (copyright, privacy).
- Preprocessing includes tokenization for text, normalization for images, augmentation for robustness.
Compute and hardware considerations
- Training state-of-the-art generative models requires large GPU/TPU clusters, optimized libraries (mixed precision, distributed training).
- Inference can be optimized by quantization, batching, and model parallelism; model size and latency must be balanced for production.
Fine-tuning, few-shot and in-context learning
- Fine-tuning: update model weights on domain-specific data for specialized behavior.
- Few-shot/in-context: provide a few examples in the prompt so a large model can adapt without weight updates.
- Retrieval-Augmented Generation (RAG): combine a retriever (search over documents) and a generator to ground outputs in external knowledge.
Safety, alignment, and RLHF (overview)
- RLHF uses human preferences to steer model behavior (e.g., avoid harmful content).
- Safety mitigations include content filters, detection models, red-teaming, and restricted APIs.
- Alignment remains an active research area — how to make models follow human values reliably.
How generative AI is used today (applications, ...