A learning path ready to make your own.

How do large language models work?

How do large language models (LLMs) work? Executive summary LLMs are large neural networks (mostly Transformer-based) trained on massive text corpora with self-supervised objectives (e.g., next-token prediction or masked reconstruction). They convert token sequences to continuous vectors, process them with stacked self-attention and feed‑forward layers, and produce probability distributions over tokens. Scaling model size, data, and compute yields broad linguistic, factual, and limited reasoning abilities. Practical use adds fine‑tuning, instruction tuning, retrieval, or RL from human feedback. Key caveats: hallucinations, bias, high compute cost, and important safety/governance challenges. Historical context (brief) N-grams → neural LMs (embeddings, RNNs) → attention and seq2seq breakthroughs. Transformer (2017) replaced recurrence with self-attention, enabling parallelism and long-range context. BERT (masked) and GPT (autoregressive) families demonstrated fine‑tuning and in‑context/few‑shot learning at scale. Recent advances: empirical scaling laws, distributed training, RLHF, and instruction tuning. Key concepts & building blocks Token: discrete unit (subword/word/char). Embedding: continuous vector per token; often tied to output logits. Self‑attention / Multi‑head attention: content‑dependent weighted aggregation across positions. Positional encodings: add order information to attention. FFN, residuals, layer norm: per-position transforms and stability mechanisms. Objectives: autoregressive, masked/denoising, contrastive variants, etc. Transformer architecture (core) Layer pattern: Multi‑Head Self‑Attention → Add & Norm → Feed‑Forward → Add & Norm. Attention computes softmax(QK^T / sqrt(d_k))V to mix information from all positions. Model size characterized by depth (layers), width (hidden dim), heads, and FFN size; parameters include embeddings + layer weights + final head. Training regimes, data, and optimization Pretraining on massive, diverse corpora (web, books, code, Wikipedia); data quality and deduplication matter. Fine‑tuning and instruction tuning adapt models to tasks; RLHF aligns outputs to human preferences. Optimizers & tricks: Adam/AdamW, warmup schedules, mixed precision, gradient checkpointing, and distributed parallelism (tensor/pipeline/ZeRO). Tokenization & input representation Common methods: BPE, WordPiece, SentencePiece (Unigram). Vocabulary sizes balance granularity and sequence length. Special tokens (BOS/EOS/mask/pad) and weight tying are common. Scaling laws, compute & infrastructure Performance follows empirical power‑law scaling with model size, dataset size, and compute (with diminishing returns). Training requires large GPU/TPU fleets, high‑bandwidth interconnects, and brings significant monetary and environmental costs. Efficiency trends: MoE sparsity, distillation, LoRA and other parameter‑efficient methods. Inference: sampling, decoding & context Fixed context window limits usable history; larger windows enable longer interactions. Decoding strategies: greedy, beam search, sampling with temperature, top‑k, top‑p (nucleus). In‑context learning: conditioning on examples in the prompt to perform new tasks without weight updates. Capabilities & applications Strong at generation, summarization, translation, QA, code generation, and many NLP tasks; some emergent chain‑of‑thought reasoning in very large models. Common applications: chatbots, content creation, search augmentation (RAG), programming assistants, education, and research drafting. Limitations, failure modes & risks Hallucinations and unverifiable outputs; softmax probabilities are poorly calibrated as confidence. Bias, toxicity, privacy leaks via memorization, and vulnerability to adversarial prompts. Resource centralization: capability concentration in well‑funded organizations. Interpretability & mechanistic understanding Tools: attention analysis, probing classifiers, and circuit‑level mechanistic interpretability. Progress made on specific circuits (e.g., induction heads), but overall comprehension of high‑dimensional models remains incomplete. Alignment, safety & governance Approaches: dataset curation, safe instruction datasets, RLHF, retrieval/grounding, safety filters, and red‑teaming. Governance: model cards, transparency, regulatory discussion, and deployment controls are active areas of work. Current state (2024–2025) & trends Multiple production LLMs at 100B+ scale; rapid innovation in multimodality and retrieval‑augmented systems. Greater democratization via open weights and parameter‑efficient tuning, but compute barriers still matter. Future directions & open research questions Multimodal unified models, very long‑context memory, efficient/green training, better reasoning and compositionality. Scaled mechanistic interpretability, robust alignment methods, privacy‑preserving personalization, and embodied agent integration. Practical appendix & developer tips (high level) Pseudocode: scaled dot‑product attention and simple autoregressive generation loops illustrate core computations. For applications: prefer pretrained APIs for fast prototyping; use retrieval/grounding for factual tasks; apply instruction tuning or RLHF for alignment; monitor for toxicity and privacy leaks. Evaluate with both automatic metrics and human judgment; maintain human review for critical outputs. Recommended readings Vaswani et al., “Attention is All You Need” (2017) Devlin et al., “BERT” (2018); Radford et al. and Brown et al., “GPT‑3” (2020) Kaplan et al., “Scaling Laws…” (2020); Ouyang et al., “RLHF/InstructGPT” (2022); Hoffmann et al., “Compute‑Optimal LMs” (2022) Conclusion LLMs combine Transformers, massive data, and large compute to produce powerful, general language models. They enable many applications but bring technical, ethical, and societal challenges. Ongoing work focuses on efficiency, interpretability, robust alignment, and responsible governance.

Let the lesson walk with you.

Podcast

How do large language models work? podcast

0:00-3:50

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

How do large language models work? flashcards

17 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

How do large language models work? quiz

12 questions

What is the primary training objective of a large language model (LLM) as described in the content's executive summary?

Read deeper, connect wider, own the subject.

Deep Article

How do large language models work? =================================

Contents


  • Executive summary
  • Historical context and evolution
  • Key concepts and building blocks
  • Transformer architecture in detail
  • Training regimes and objectives
  • Tokenization and input representation
  • Scaling laws, compute, and infrastructure
  • Inference: sampling, decoding, and context
  • Capabilities, applications, and examples
  • Limitations, failure modes, and risks
  • Interpretability and mechanistic understanding
  • Alignment, safety, and governance
  • Current state of the field
  • Future directions and open research questions
  • Practical appendix: pseudocode and short examples
  • Recommended readings

Executive summary


Large language models (LLMs) are neural networks trained on very large text corpora to predict, generate, or otherwise model language. Modern LLMs are predominantly based on the Transformer architecture and are trained using self-supervised objectives (e.g., next-token prediction or masked token reconstruction). By scaling model size, data, and compute, these models acquire broad linguistic, factual, and some reasoning capabilities. They operate by transforming discrete token sequences into continuous vector representations, processing those with layers of self-attention and feed-forward networks, and producing probability distributions over next tokens. Practical usage combines the pretrained model with fine-tuning, prompting, or retrieval to perform downstream tasks. While powerful, LLMs have significant limitations: they hallucinate, can reflect biases in training data, are resource-intensive, and raise important societal and safety concerns.

Historical context and evolution


  • N-gram models (1950s–2000s): Statistical models using fixed-length histories to predict next words. Simple, interpretable, but limited context and heavy sparsity problems.
  • Neural language models (1990s–2010s): Feed-forward and recurrent neural networks (Elman 1990; Bengio et al. 2003) introduced distributed word representations (embeddings) and handled generalization better than n-grams.
  • Sequence-to-sequence and attention (2014–2017): Encoder-decoder RNNs with attention (Bahdanau et al., 2015; Luong et al.) enabled translation and mapping between variable-length sequences.
  • The Transformer (Vaswani et al., 2017): Replaced recurrence with self-attention, yielding better parallelism and longer-range context. This architecture became the basis for most LLMs.
  • BERT and masked models (2018): Bidirectional masked-language-model pretraining improved many NLP tasks via fine-tuning.
  • GPT family and decoder-only LMs (2018–present): Autoregressive pretraining (GPT, GPT-2, GPT-3) scaled up model size and data, showing emergent few-shot and in-context learning abilities.
  • Scaling laws and system engineering (2020s): Empirical scaling laws (Kaplan et al., 2020), optimization techniques, distributed training, and mixed precision enabled models with hundreds of billions to trillions of parameters.
  • Alignment and safety (late 2010s–2020s): RL from human feedback (RLHF), instruction tuning, and guardrails to reduce harmful outputs.

Key concepts and building blocks


  • Token: discrete atomic unit (word/subword/character) representing input text.
  • Embedding: continuous vector representation for tokens.
  • Self-attention: mechanism that computes pairwise interactions between tokens to produce context-aware representations.
  • Multi-head attention: parallel attention strands capturing different relational patterns.
  • Positional encoding: injects information about token order since attention alone is permutation-invariant.
  • Feed-forward network (FFN): per-position MLP that projects and transforms features.
  • Layer normalization and residual connections: stabilize training in deep stacks.
  • Pretraining vs fine-tuning: self-supervised learning on large corpora followed by task-specific adaptation.
  • Autoregressive (causal) vs masked/denoising objectives: different training targets yielding different model behaviors.
  • Softmax and logits: final linear layer maps hidden states to token logits; softmax converts logits to probabilities.
  • Sampling/decoding: methods like greedy, beam search, top-k, top-p (nucleus), and temperature to generate tokens.

Transformer architecture in detail


The Transformer layer is the core of modern LLMs. A stacked sequence of these layers transforms an input token sequence into contextualized representations.

Main components (per layer):

  1. Multi-head self-attention (MHSA)
  2. Add & Norm (residual + layer norm)
  3. Position-wise feed-forward network (FFN)
  4. Add & Norm

Self-attention (single head) — formula

  • Given queries Q, keys K, and values V (all size sequencelength × dmodel):

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

  • Q, K, V are linear projections of input features X: Q = X W_Q, etc.
  • Multi-head: run h attention heads with lower dimensional projections, then concat and project back.

Why attention works

  • Attention computes a weighted sum of value vectors where weights reflect pairwise compatibility between tokens (via dot product of queries and keys). This allows each output token representation to integrate evidence from any input position with learned, content-dependent weights—capturing long-range dependencies more effectively than recurrence.

Positional encoding

  • Since attention is permutation-invariant, Transformers add position information either via fixed sinusoidal encodings (Vaswani) or learned positional embeddings, or relative positional encodings.

Depth, width, and parameterization

  • "Depth" = number of layers; "width" = hidden dimension dmodel; "heads" = number of attention heads; "FFN inner dimension" often 4× dmodel. Model parameters: token embedding table + layer parameters + final linear head.

Training regimes and objectives


  • Autoregressive (causal) language modeling:
  • Objective: maximize likelihood of next token given previous tokens.
  • Used in GPT-like models.
  • Masked language modeling (MLM):
  • Random tokens masked; model predicts masked tokens using both left and right context.
  • Used in BERT-like models.
  • Denoising objectives:
  • Span corruption (T5): mask spans and predict them using sequence-to-sequence mapping.
  • Contrastive objectives, permutation LM (XLNet), and ELECTRA’s replaced-token detection are other variants.

Pretraining data

  • Massive, diverse text corpora: web crawls (Common Crawl), books, Wikipedia, code, forums, news, curated datasets. Data quality, deduplication, and filtering significantly affect outcomes.

Fine-tuning and instruction tuning

  • Fine-tuning: supervised training on labeled datasets to adapt to specific tasks.
  • Instruction tuning: training on many human-written instruction-response pairs to improve instruction following.
  • RLHF: reinforcement learning where human preferences or reward models guide generation toward desired behavior (e.g., helpfulness + safety).

Optimization and practical training tricks

  • Optimizers: Adam/AdamW variants with weight decay.
  • Learning rate schedules: linear warmup + decay.
  • Regularization: dropout, layer-wise LR, gradient clipping.
  • Mixed precision (FP16/BF16) and gradient checkpointing to reduce memory.
  • Distributed training: data parallelism, model parallelism (tensor + pipeline), ZeRO optimizations.
  • Checkpointing and stability: techniques to prevent divergence at large scale.

Tokenization and input representation


  • Byte-Pair Encoding (BPE): subword segmentation by frequency merges.
  • SentencePiece Unigram: probabilistic subword model.
  • WordPiece: variant used in BERT.
  • Token vocabulary size ranges from ~30k to >100k—tradeoff between granularity and sequence length.
  • Special tokens: BOS/EOS, padding, mask, task-specific markers.
  • Token embedding matrix ties input embeddings and output logits (weight tying).

Scaling laws, compute, and infrastructure


  • Empirical scaling laws (Kaplan et al., 2020): model performance improves predictably as a power-law function of model size, dataset size, and compute, up to limits (and with diminishing returns).
  • Compute: training a modern LLM can require thousands to tens of thousands of GPU-months (or TPU pods).
  • Hardware: GPUs (A100), TPUs, ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.