How do large language models work? =================================
Contents
- Executive summary
- Historical context and evolution
- Key concepts and building blocks
- Transformer architecture in detail
- Training regimes and objectives
- Tokenization and input representation
- Scaling laws, compute, and infrastructure
- Inference: sampling, decoding, and context
- Capabilities, applications, and examples
- Limitations, failure modes, and risks
- Interpretability and mechanistic understanding
- Alignment, safety, and governance
- Current state of the field
- Future directions and open research questions
- Practical appendix: pseudocode and short examples
- Recommended readings
Executive summary
Large language models (LLMs) are neural networks trained on very large text corpora to predict, generate, or otherwise model language. Modern LLMs are predominantly based on the Transformer architecture and are trained using self-supervised objectives (e.g., next-token prediction or masked token reconstruction). By scaling model size, data, and compute, these models acquire broad linguistic, factual, and some reasoning capabilities. They operate by transforming discrete token sequences into continuous vector representations, processing those with layers of self-attention and feed-forward networks, and producing probability distributions over next tokens. Practical usage combines the pretrained model with fine-tuning, prompting, or retrieval to perform downstream tasks. While powerful, LLMs have significant limitations: they hallucinate, can reflect biases in training data, are resource-intensive, and raise important societal and safety concerns.
Historical context and evolution
- N-gram models (1950s–2000s): Statistical models using fixed-length histories to predict next words. Simple, interpretable, but limited context and heavy sparsity problems.
- Neural language models (1990s–2010s): Feed-forward and recurrent neural networks (Elman 1990; Bengio et al. 2003) introduced distributed word representations (embeddings) and handled generalization better than n-grams.
- Sequence-to-sequence and attention (2014–2017): Encoder-decoder RNNs with attention (Bahdanau et al., 2015; Luong et al.) enabled translation and mapping between variable-length sequences.
- The Transformer (Vaswani et al., 2017): Replaced recurrence with self-attention, yielding better parallelism and longer-range context. This architecture became the basis for most LLMs.
- BERT and masked models (2018): Bidirectional masked-language-model pretraining improved many NLP tasks via fine-tuning.
- GPT family and decoder-only LMs (2018–present): Autoregressive pretraining (GPT, GPT-2, GPT-3) scaled up model size and data, showing emergent few-shot and in-context learning abilities.
- Scaling laws and system engineering (2020s): Empirical scaling laws (Kaplan et al., 2020), optimization techniques, distributed training, and mixed precision enabled models with hundreds of billions to trillions of parameters.
- Alignment and safety (late 2010s–2020s): RL from human feedback (RLHF), instruction tuning, and guardrails to reduce harmful outputs.
Key concepts and building blocks
- Token: discrete atomic unit (word/subword/character) representing input text.
- Embedding: continuous vector representation for tokens.
- Self-attention: mechanism that computes pairwise interactions between tokens to produce context-aware representations.
- Multi-head attention: parallel attention strands capturing different relational patterns.
- Positional encoding: injects information about token order since attention alone is permutation-invariant.
- Feed-forward network (FFN): per-position MLP that projects and transforms features.
- Layer normalization and residual connections: stabilize training in deep stacks.
- Pretraining vs fine-tuning: self-supervised learning on large corpora followed by task-specific adaptation.
- Autoregressive (causal) vs masked/denoising objectives: different training targets yielding different model behaviors.
- Softmax and logits: final linear layer maps hidden states to token logits; softmax converts logits to probabilities.
- Sampling/decoding: methods like greedy, beam search, top-k, top-p (nucleus), and temperature to generate tokens.
Transformer architecture in detail
The Transformer layer is the core of modern LLMs. A stacked sequence of these layers transforms an input token sequence into contextualized representations.
Main components (per layer):
- Multi-head self-attention (MHSA)
- Add & Norm (residual + layer norm)
- Position-wise feed-forward network (FFN)
- Add & Norm
Self-attention (single head) — formula
- Given queries Q, keys K, and values V (all size sequencelength × dmodel):
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
- Q, K, V are linear projections of input features X: Q = X W_Q, etc.
- Multi-head: run h attention heads with lower dimensional projections, then concat and project back.
Why attention works
- Attention computes a weighted sum of value vectors where weights reflect pairwise compatibility between tokens (via dot product of queries and keys). This allows each output token representation to integrate evidence from any input position with learned, content-dependent weights—capturing long-range dependencies more effectively than recurrence.
Positional encoding
- Since attention is permutation-invariant, Transformers add position information either via fixed sinusoidal encodings (Vaswani) or learned positional embeddings, or relative positional encodings.
Depth, width, and parameterization
- "Depth" = number of layers; "width" = hidden dimension dmodel; "heads" = number of attention heads; "FFN inner dimension" often 4× dmodel. Model parameters: token embedding table + layer parameters + final linear head.
Training regimes and objectives
- Autoregressive (causal) language modeling:
- Objective: maximize likelihood of next token given previous tokens.
- Used in GPT-like models.
- Masked language modeling (MLM):
- Random tokens masked; model predicts masked tokens using both left and right context.
- Used in BERT-like models.
- Denoising objectives:
- Span corruption (T5): mask spans and predict them using sequence-to-sequence mapping.
- Contrastive objectives, permutation LM (XLNet), and ELECTRA’s replaced-token detection are other variants.
Pretraining data
- Massive, diverse text corpora: web crawls (Common Crawl), books, Wikipedia, code, forums, news, curated datasets. Data quality, deduplication, and filtering significantly affect outcomes.
Fine-tuning and instruction tuning
- Fine-tuning: supervised training on labeled datasets to adapt to specific tasks.
- Instruction tuning: training on many human-written instruction-response pairs to improve instruction following.
- RLHF: reinforcement learning where human preferences or reward models guide generation toward desired behavior (e.g., helpfulness + safety).
Optimization and practical training tricks
- Optimizers: Adam/AdamW variants with weight decay.
- Learning rate schedules: linear warmup + decay.
- Regularization: dropout, layer-wise LR, gradient clipping.
- Mixed precision (FP16/BF16) and gradient checkpointing to reduce memory.
- Distributed training: data parallelism, model parallelism (tensor + pipeline), ZeRO optimizations.
- Checkpointing and stability: techniques to prevent divergence at large scale.
Tokenization and input representation
- Byte-Pair Encoding (BPE): subword segmentation by frequency merges.
- SentencePiece Unigram: probabilistic subword model.
- WordPiece: variant used in BERT.
- Token vocabulary size ranges from ~30k to >100k—tradeoff between granularity and sequence length.
- Special tokens: BOS/EOS, padding, mask, task-specific markers.
- Token embedding matrix ties input embeddings and output logits (weight tying).
Scaling laws, compute, and infrastructure
- Empirical scaling laws (Kaplan et al., 2020): model performance improves predictably as a power-law function of model size, dataset size, and compute, up to limits (and with diminishing returns).
- Compute: training a modern LLM can require thousands to tens of thousands of GPU-months (or TPU pods).
- Hardware: GPUs (A100), TPUs, ...