How do large language models work?
Contents
- Executive summary
- Historical context and evolution
- Key concepts and building blocks
- Transformer architecture in detail
- Training regimes and objectives
- Tokenization and input representation
- Scaling laws, compute, and infrastructure
- Inference: sampling, decoding, and context
- Capabilities, applications, and examples
- Limitations, failure modes, and risks
- Interpretability and mechanistic understanding
- Alignment, safety, and governance
- Current state of the field
- Future directions and open research questions
- Practical appendix: pseudocode and short examples
- Recommended readings
Executive summary
Large language models (LLMs) are neural networks trained on very large text corpora to predict, generate, or otherwise model language. Modern LLMs are predominantly based on the Transformer architecture and are trained using self-supervised objectives (e.g., next-token prediction or masked token reconstruction). By scaling model size, data, and compute, these models acquire broad linguistic, factual, and some reasoning capabilities. They operate by transforming discrete token sequences into continuous vector representations, processing those with layers of self-attention and feed-forward networks, and producing probability distributions over next tokens. Practical usage combines the pretrained model with fine-tuning, prompting, or retrieval to perform downstream tasks. While powerful, LLMs have significant limitations: they hallucinate, can reflect biases in training data, are resource-intensive, and raise important societal and safety concerns.
Historical context and evolution
- N-gram models (1950s–2000s): Statistical models using fixed-length histories to predict next words. Simple, interpretable, but limited context and heavy sparsity problems.
- Neural language models (1990s–2010s): Feed-forward and recurrent neural networks (Elman 1990; Bengio et al. 2003) introduced distributed word representations (embeddings) and handled generalization better than n-grams.
- Sequence-to-sequence and attention (2014–2017): Encoder-decoder RNNs with attention (Bahdanau et al., 2015; Luong et al.) enabled translation and mapping between variable-length sequences.
- The Transformer (Vaswani et al., 2017): Replaced recurrence with self-attention, yielding better parallelism and longer-range context. This architecture became the basis for most LLMs.
- BERT and masked models (2018): Bidirectional masked-language-model pretraining improved many NLP tasks via fine-tuning.
- GPT family and decoder-only LMs (2018–present): Autoregressive pretraining (GPT, GPT-2, GPT-3) scaled up model size and data, showing emergent few-shot and in-context learning abilities.
- Scaling laws and system engineering (2020s): Empirical scaling laws (Kaplan et al., 2020), optimization techniques, distributed training, and mixed precision enabled models with hundreds of billions to trillions of parameters.
- Alignment and safety (late 2010s–2020s): RL from human feedback (RLHF), instruction tuning, and guardrails to reduce harmful outputs.
Key concepts and building blocks
- Token: discrete atomic unit (word/subword/character) representing input text.
- Embedding: continuous vector representation for tokens.
- Self-attention: mechanism that computes pairwise interactions between tokens to produce context-aware representations.
- Multi-head attention: parallel attention strands capturing different relational patterns.
- Positional encoding: injects information about token order since attention alone is permutation-invariant.
- Feed-forward network (FFN): per-position MLP that projects and transforms features.
- Layer normalization and residual connections: stabilize training in deep stacks.
- Pretraining vs fine-tuning: self-supervised learning on large corpora followed by task-specific adaptation.
- Autoregressive (causal) vs masked/denoising objectives: different training targets yielding different model behaviors.
- Softmax and logits: final linear layer maps hidden states to token logits; softmax converts logits to probabilities.
- Sampling/decoding: methods like greedy, beam search, top-k, top-p (nucleus), and temperature to generate tokens.
Transformer architecture in detail
The Transformer layer is the core of modern LLMs. A stacked sequence of these layers transforms an input token sequence into contextualized representations.
Main components (per layer):
- Multi-head self-attention (MHSA)
- Add & Norm (residual + layer norm)
- Position-wise feed-forward network (FFN)
- Add & Norm
Self-attention (single head) — formula
- Given queries Q, keys K, and values V (all size sequence_length × d_model): Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
- Q, K, V are linear projections of input features X: Q = X W_Q, etc.
- Multi-head: run h attention heads with lower dimensional projections, then concat and project back.
Why attention works
- Attention computes a weighted sum of value vectors where weights reflect pairwise compatibility between tokens (via dot product of queries and keys). This allows each output token representation to integrate evidence from any input position with learned, content-dependent weights—capturing long-range dependencies more effectively than recurrence.
Positional encoding
- Since attention is permutation-invariant, Transformers add position information either via fixed sinusoidal encodings (Vaswani) or learned positional embeddings, or relative positional encodings.
Depth, width, and parameterization
- "Depth" = number of layers; "width" = hidden dimension d_model; "heads" = number of attention heads; "FFN inner dimension" often 4× d_model. Model parameters: token embedding table + layer parameters + final linear head.
Training regimes and objectives
- Autoregressive (causal) language modeling:
- Objective: maximize likelihood of next token given previous tokens.
- Used in GPT-like models.
- Masked language modeling (MLM):
- Random tokens masked; model predicts masked tokens using both left and right context.
- Used in BERT-like models.
- Denoising objectives:
- Span corruption (T5): mask spans and predict them using sequence-to-sequence mapping.
- Contrastive objectives, permutation LM (XLNet), and ELECTRA’s replaced-token detection are other variants.
Pretraining data
- Massive, diverse text corpora: web crawls (Common Crawl), books, Wikipedia, code, forums, news, curated datasets. Data quality, deduplication, and filtering significantly affect outcomes.
Fine-tuning and instruction tuning
- Fine-tuning: supervised training on labeled datasets to adapt to specific tasks.
- Instruction tuning: training on many human-written instruction-response pairs to improve instruction following.
- RLHF: reinforcement learning where human preferences or reward models guide generation toward desired behavior (e.g., helpfulness + safety).
Optimization and practical training tricks
- Optimizers: Adam/AdamW variants with weight decay.
- Learning rate schedules: linear warmup + decay.
- Regularization: dropout, layer-wise LR, gradient clipping.
- Mixed precision (FP16/BF16) and gradient checkpointing to reduce memory.
- Distributed training: data parallelism, model parallelism (tensor + pipeline), ZeRO optimizations.
- Checkpointing and stability: techniques to prevent divergence at large scale.
Tokenization and input representation
- Byte-Pair Encoding (BPE): subword segmentation by frequency merges.
- SentencePiece Unigram: probabilistic subword model.
- WordPiece: variant used in BERT.
- Token vocabulary size ranges from ~30k to >100k—tradeoff between granularity and sequence length.
- Special tokens: BOS/EOS, padding, mask, task-specific markers.
- Token embedding matrix ties input embeddings and output logits (weight tying).
Scaling laws, compute, and infrastructure
- Empirical scaling laws (Kaplan et al., 2020): model performance improves predictably as a power-law function of model size, dataset size, and compute, up to limits (and with diminishing returns).
- Compute: training a modern LLM can require thousands to tens of thousands of GPU-months (or TPU pods).
- Hardware: GPUs (A100), TPUs, specialized accelerators; high-bandwidth interconnects for model/data parallelism.
- Energy and cost: significant monetary and environmental costs; trends toward more efficient training methods (sparsity, LoRA, distillation).
Inference: sampling, decoding, and context
- Context window: model has a fixed maximum input length (context window). Large windows enable longer conversations and document-level tasks.
- Decoding strategies:
- Greedy: choose argmax each step (fast, repetitive).
- Beam search: keeps k best sequences (common in seq2seq tasks).
- Sampling with temperature: adjusts randomness. Temperature <1 sharpens distribution; >1 flattens it.
- Top-k and top-p (nucleus sampling): restrict sampling to top tokens to avoid low-probability tails.
- In-context learning: LLMs can be conditioned on few examples in the prompt to perform new tasks without gradient updates.
Capabilities, applications, and examples
Capabilities:
- Language generation: coherent paragraphs, storytelling.
- Summarization: extractive/abstractive summaries.
- Translation: high-quality translation for many language pairs.
- Question answering and knowledge retrieval.
- Code generation and assistance (e.g., autocomplete, bug fixes).
- Reasoning and chain-of-thought (limited, emergent in large models).
- Classification, extraction, data-to-text.
Applications:
- Chatbots and virtual assistants.
- Content creation (articles, marketing copy).
- Customer support automation.
- Education and tutoring.
- Search augmentation and retrieval-augmented generation (RAG).
- Programming assistants (e.g., GitHub Copilot).
- Scientific literature search and drafting.
Example: In-context learning
- Prompt with examples (input → output) and a new input; the model continues by producing the output, effectively “learning” from the prompt.
Limitations, failure modes, and risks
- Hallucinations: generating plausible-sounding but false or unverifiable statements.
- Calibration and overconfidence: softmax probabilities are not reliable confidence scores.
- Biases and toxic language: reproduction/amplification of societal biases present in training data.
- Memorization and privacy: models can regurgitate verbatim training text, including personal data.
- Adversarial prompting: prompts engineered to elicit harmful outputs.
- Context window constraints: lose long-term context beyond window.
- Fragile reasoning: models may mimic reasoning chains but can be brittle on multi-step logical tasks.
- Compute and resource centralization: concentration of capabilities in a few well-funded organizations.
Interpretability and mechanistic understanding
- Activation and attention analyses: inspecting attention maps provides partial intuition but doesn’t fully explain behavior.
- Probing classifiers: test if linguistic features are encoded in activations.
- Mechanistic interpretability: attempt to map learned features to human-understandable circuits (ongoing research; notable work shows some specific circuits for tasks like induction heads).
- Limitations: overall understanding is incomplete; models are complex high-dimensional systems.
Alignment, safety, and governance
- Alignment: making models act according to human values and intentions.
- Techniques:
- Dataset filtering and pretraining curation.
- Fine-tuning with safe instruction datasets.
- RLHF (Ouyang et al., 2022) to optimize for human preferences.
- External tools: retrieval, safety filters, and rule-based checks.
- Governance and regulation: transparency, model cards, red-teaming, and policy-level controls are being explored to manage societal impacts.
Current state of the field (as of 2024–2025)
- Models in production: numerous LLMs at scale (100B+ parameters) deployed commercially.
- Rapid pace: improvements in model engineering, multimodal integration (text+image+audio), and retrieval-augmented systems.
- Democratization trends: open weights (e.g., LLaMA, OPT) and efficient fine-tuning methods (LoRA) enable broader access.
- Consolidation: heavy compute/cost still centralizes cutting-edge models in large organizations.
- Safety and policy: active research and regulation efforts; significant public debate.
Future directions and open research questions
- Multimodality: unified models handling text, vision, audio, and structured data.
- Long context modeling: efficient memory and recurrence for very long documents.
- Efficient and green training: sparsity (Mixture-of-Experts), pruning, parameter-efficient fine-tuning, distillation.
- Better reasoning and compositionality: strengthening systematic generalization.
- Mechanistic interpretability at scale: understanding circuits and emergent behaviors.
- Robust alignment: scalable methods to ensure safety, truthfulness, and value alignment.
- Personalization and privacy-preserving models.
- Embodied agents and tool use: integrating LLMs as planners controlling robots or software agents.
Practical appendix: pseudocode and short examples
- Scaled dot-product attention (pseudocode)
1# Inputs:
2# X: [seq_len, d_model] input representations
3# Wq, Wk, Wv: projection matrices to d_k or d_v
4# Returns:
5# Output: [seq_len, d_model]
6
7Q = X @ Wq # [seq_len, d_k]
8K = X @ Wk # [seq_len, d_k]
9V = X @ Wv # [seq_len, d_v]
10
11scores = Q @ K.T # [seq_len, seq_len]
12scores /= sqrt(d_k)
13
14weights = softmax(scores, axis=1) # row-normalized
15
16Output = weights @ V # [seq_len, d_v]
17# For multi-head, perform several such projections in parallel, concat, then project back.- Autoregressive generation loop (simple)
1context = initial_tokens # list of token ids
2while len(context) < max_length:
3 logits = model(context) # model returns logits for next token
4 probs = softmax(logits / temperature)
5 next_token = sample(probs) # sampling strategy: argmax, top-k, top-p, etc.
6 if next_token == EOS:
7 break
8 context.append(next_token)- Example: retrieval-augmented generation (RAG) high-level flow
- User query q
- Retrieve top-k documents D from an external corpus using vector search (embedding-based)
- Construct prompt: [system instruction] + [documents D] + [query q]
- Generate answer conditioned on retrieval (reduces hallucination; allows up-to-date info)
Practical tips for developers
- Use pretrained models and API endpoints for fast iteration.
- Use instruction tuning and RLHF when high-quality aligned behavior is required.
- Prefer retrieval or grounding for factual tasks to reduce hallucinations.
- Apply prompt engineering carefully; prefer few-shot or chain-of-thought prompts if beneficial.
- Monitor outputs for toxicity, bias, and privacy leaks; use filters and human review for critical applications.
- Evaluate models with both automatic metrics and human judgment; design task-specific metrics.
Recommended readings
- Vaswani et al., “Attention is All You Need” (2017)
- Devlin et al., “BERT” (2018)
- Radford et al., “GPT” series papers and Brown et al., “GPT-3” (2020)
- Kaplan et al., “Scaling Laws for Neural Language Models” (2020)
- Ouyang et al., “Training Language Models to Follow Instructions with Human Feedback” (InstructGPT / RLHF, 2022)
- Hoffmann et al., “Training Compute-Optimal Large Language Models” (2022)
- Clark et al., various interpretability and mechanistic papers
Concluding remarks
Large language models are powerful, general-purpose statistical models of language whose capabilities emerge from combining large amounts of data, effective architectures (Transformers), and vast compute. They have transformed NLP and many downstream fields but come with technical, ethical, and societal challenges. Progress will hinge on improving efficiency, robustness, interpretability, and alignment while ensuring equitable access and governance.
If you want, I can:
- Provide a concise explainer suitable for non-technical audiences.
- Walk through a hands-on example using a specific open-source model and code (Hugging Face, PyTorch).
- Dive deeper into any subsection above (math of attention, RLHF details, mechanistic interpretability papers). Which would you like next?