PostgreSQL Internals — A Deep Dive
This article is a comprehensive, in-depth examination of PostgreSQL internals. It covers history, architecture, storage, transaction and concurrency control, indexing, query planning and execution, background processes, durability and recovery, replication, extensions, observability, performance tuning, and future directions. Where useful, SQL and C code snippets illustrate how to inspect and extend the server.
Table of contents
- Introduction and brief history
- High-level architecture
- Storage layout: files, pages, tuples, TOAST
- MVCC, tuple visibility, and transaction IDs
- Locking, isolation levels, and Serializable Snapshot Isolation (SSI)
- Write-Ahead Logging (WAL), checkpoints, recovery, and PITR
- Buffer management and caching
- Vacuum, autovacuum, bloat, and wraparound
- Query planner and optimizer internals
- Executor and runtime components
- Index types and internals (B-tree, GIN, GiST, BRIN, SP-GiST)
- Replication and high-availability (physical and logical)
- Extensions, hooks, and background workers
- Observability, statistics, and diagnostic tools
- Performance tuning: key GUCs and practical knobs
- Tools, debugging and reading WAL
- Current state and future directions
- Practical examples and code snippets
- Conclusion and recommended reading
Introduction and brief history
PostgreSQL is a mature, open-source relational database system descended from the POSTGRES project at UC Berkeley (Stonebraker et al.). It has supported advanced features like extensibility, complex SQL, and MVCC for decades. PostgreSQL’s modern architecture evolved to emphasize correctness, extensibility, and a stable interface for extensions and storage.
Key historical milestones:
- POSTGRES (1986–1994): object-relational research.
- PostgreSQL 6.x–8.x: stabilization and SQL standard compliance.
- PostgreSQL 9.x: streaming replication, hot standby, and logical decoding.
- PostgreSQL 10+ : native logical replication, declarative partitioning improvements.
- Recent: JIT compilation for expressions (libgccjit), parallel query improvements, improved partitioning, incremental improvements to replication and monitoring.
High-level architecture
PostgreSQL uses a multi-process architecture: each client connection is a separate OS process (a server process). There is a "postmaster" (now just the "postgres" parent process) that accepts connections and forks worker processes.
Key process types:
- Backend processes: one per client connection.
- Background worker processes:
- checkpointer (checkpointer)
- background writer (bgwriter)
- wal writer (walwriter)
- wal sender / wal receiver (replication)
- autovacuum launcher and workers
- stats collector
- logical replication workers
- archiver (optional)
- Shared memory (Postgres shared buffers, locks, etc.)
- Semaphores, spinlocks, lightweight locks (LWLocks)
Inter-process communication is via shared memory, semaphores, and signals. The server uses POSIX primitives where available.
Main components (logical):
- Storage manager (heap, files, forks)
- Buffer manager (shared_buffers)
- WAL subsystem (XLOG)
- Lock manager (lightweight locks for data structures; heavyweight locks for DDL)
- Transaction manager (XID handling, subtransactions)
- Query planner and optimizer
- Executor
- Logical replication & logical decoding
Storage layout: files, pages, tuples, TOAST
Physical storage:
- Data stored in the cluster's data directory under base/ (default tablespaces), pgtblspc (tablespaces), global/, pgwal (WAL).
- Each relation (table/index) is represented by one or more files (1GB segment limit). Several forks per relation:
- main (heap/index)
- fsm (free space map) — legacy in older versions, now replaced by FSM management in shared buffers? (Note: older versions used filesystem fsm fork)
- vm (visibility map)
- init (initial content)
- Files are in a binary layout of pages (8 KB by default) called blocks.
Page format:
- Page header
- Item ID (line pointer) array
- Tuple data area (grows from end toward the line pointer array)
- Special space (index-specific)
- Checksum (optional)
Heap tuple header includes:
- txmin (inserting transaction id), txmax (deleting/locking TX)
- t_cid (command id)
- t_infomask bits (visibility flags, hint bits)
- t_ctid (pointer to physical row version)
- optional frozen/xmin/xmax markers
TOAST
- "The Oversized-Attribute Storage Technique"; large attributes (TOAST-able columns like large text or bytea) are stored out-of-line in a TOAST table with compression and chunking. This keeps main heap pages smaller and reduces I/O for common operations.
Visibility map
- A map recording pages that are all-visible (no tuples require visibility checks). Helps vacuum and index-only scans.
MVCC, tuple visibility, and transaction IDs
PostgreSQL implements MVCC by storing multiple versions of tuples (row versions) in-place (updates create new tuple versions in the same relation file). There is no centralized version store or delta; instead each tuple carries xmin/xmax.
Visibility logic
- A backend uses a snapshot to decide which tuples are visible:
- tuples with xmin committed before snapshot and not deleted before snapshot are visible
- tuples with xmax set to a transaction committed before snapshot are invisible
- Snapshot contains information: xmin, xmax, and an array of in-progress XIDs.
Transaction IDs (XIDs)
- 32-bit wraparound: requires freezing of old rows via VACUUM to prevent incorrect visibility after wraparound.
- Frozen transactions mark tuples as effectively immortal (use
FrozenXID). - autovacuum/vacuum freeze aims to prevent XID wraparound.
Hint bits
- To avoid consulting WAL on visibility repeatedly, backends set hint bits in the tuple header once a tuple status (committed/aborted) is known. These hint bits are non-logged (but can be set using WAL if WAL-logged hint bit updates are disabled? Typically hint bit changes are not WAL-logged; but their presence is safe because they encode info that can be recomputed).
CTID
- After updates, the old row remains; the new version gets a new physical location. CTID refers to the physical location allowing e.g. internal navigation or updates by physical tuple reference.
Subtransactions (savepoints)
- Implemented using sub-XIDs. When a subtransaction commits, its XIDs are converted into a top-level XID, with translation arrays.
Snapshot types
- MVCC snapshot (current/older view)
- Historic snapshots for logical decoding or serializable transactions.
Locking, isolation levels, and Serializable Snapshot Isolation (SSI)
Lock manager
- Lightweight locks (LWLocks): protect buffer control structures and data structures inside PostgreSQL.
- Heavyweight locks: relation-level locks (AccessShareLock, AccessExclusiveLock, etc.) used for DDL and concurrency control.
- Tuple-level locking: Use of xmin/xmax and lock tuples via updating xmax with special lock parameters.
Isolation levels
- Read Uncommitted: maps to Read Committed behavior (Postgres does not support dirty reads).
- Read Committed: each statement sees a snapshot as of the start of the statement; new snapshot per statement.
- Repeatable Read: fixed snapshot for the whole transaction (since Postgres 9.1, Repeatable Read used to be serializable? Now Postgres implements Serializable as SSI).
- Serializable: implemented by Serializable Snapshot Isolation (SSI), which detects dangerous structures and aborts transactions that would violate serializability.
Serializable Snapshot Isolation (SSI)
- Allows snapshot isolation semantics but detects write-read or read-write conflicts forming dangerous structures and aborts one of the transactions.
- Postgres tracks read/write dependencies with predicate locking utilities and conflicts to maintain serializability.
Predicate locks and index-only locking
- Postgres uses predicate locks logically for range queries via index-based predicate checks (e.g., during range scans) to detect phantoms under Serializable isolation. Implementation uses SIREAD locks (no physical lock is attached to tuples, it's recorded in a separate structure).
Two-phase commit
- Prepare and commit for distributed transactions with external resource managers.
Lock monitoring
- Views: pglocks, pgstatactivity, pgstat_activity shows waiting/blocked states.
Write-Ahead Logging (WAL), checkpoints, recovery, and PITR
WAL fundamentals
- WAL records every change needed to reconstruct data pages; writes WAL to disk before pages are flushed (WAL-first principle). This enables crash recovery.
- WAL files live in pgwal/ (or pgxlog/ pre-9.5). WAL segments are 16 MB by default.
WAL insert and flush
- XLogInsert builds a WAL record. WAL writer flushes groups of WAL records to disk (synchronous or asynchronous depending on commit settings).
- Commit record ensures durability of transaction.
Checkpoints
- Checkpointer scans buffers and writes dirty buffers to disk to make WAL replay shorter. Checkpoints write checkpoint WAL records and update pg_control.
- Frequent checkpoints reduce recovery time but increase I/O.
Recovery and PITR
- pg_basebackup or file copy for base backup.
- Archive WAL segments (archive_command) for point-in-time recovery.
- Recovery applies WAL records from base backup forward to reconstruct consistent state.
- Timelines track branches from recovery.
Synchronous and asynchronous commit
- synchronous_commit controls whether a transaction waits for WAL and/or commit to be flushed.
- fsync and fullpagewrites settings influence durability and performance.
Logical decoding (logical WAL)
- WAL can be decoded into logical changes for replication or auditing using logical decoding plugins (wal2json, pgoutput). Logical decoding relies on consistent snapshots and retention of WAL via replication slots.
WAL configuration keys
- wal_level (minimal, replica, logical)
- maxwalsenders
- walkeepsize
- maxreplicationslots
- checkpointtimeout, checkpointcompletion_target
- maxwalsize, minwalsize
- synchronous_commit
WAL and parallelism
- WAL must serialize some changes, but many types of queries and index builds can be parallelized (with careful WAL behavior).
WAL archiving and streaming replication
- WAL segments can be archived and replayed by standbys; replication can stream WAL.
Tools: pgwaldump, pgxlogdump (older), pg_receivewal, logical decoding plugins.
Buffer management and caching
Shared buffers
- shared_buffers holds cached disk pages (8 KB each).
- Buffers are protected by buffer headers and lightweight locks (buf_hdr lock) and managed by a replacement strategy.
Replacement strategy
- Historically a clock sweep algorithm with improvements. Buffer pins prevent eviction of pages in use.
Checkpoint and bgwriter
- Background writer proactively writes dirty buffers to reduce checkpoint I/O spikes.
- Checkpointer writes remaining dirty buffers during checkpoint.
OS cache
- Postgres relies on OS page cache; sharedbuffers is not the entire cache. There is interplay between sharedbuffers and OS cache; recommended sizing depends on workload.
Buffer lifecycle considerations
- Read: buffer read into shared_buffers -> pinned -> used -> unpinned.
- Write: modification marks buffer dirty and WAL is emitted for correct page updates.
Buffer management internals
- Buffer descriptor: keeps flags, refcount, lock, tag (block id), usage_count for eviction.
Vacuum, autovacuum, bloat, and wraparound
VACUUM
- Reclaims space from dead tuples and makes tuples visible for reuse.
- Vacuum updates visibility map and may set hint bits and freeze tuples.
- VACUUM FULL rewrites the table, reclaiming physical disk space (exclusive lock).
Autovacuum
- Background autovacuum workers run automatically to clean tables and prevent wraparound.
- Autovacuum settings include autovacuumvacuumthreshold, autovacuumvacuumscalefactor, autovacuumanalyzescalefactor, autovacuummaxworkers, autovacuumnaptime, autovacuumfreezemaxage.
Bloat
- Dead tuple accumulation increases table size and slows scans.
- Regular autovacuum tuning and aggressive VACUUMs in high-update workloads reduce bloat.
- Partitioning and HOT updates reduce bloat.
HOT (Heap-Only Tuple) updates
- Update that creates a new tuple in the same page without updating indexes when index columns are unchanged, avoiding index maintenance and reducing WAL.
- HOT chain: multiple HOT updates can chain, but can complicate vacuum.
Wraparound protection
- Because XIDs are 32-bit, Postgres must freeze tuples older than a threshold; autovacuum+vacuum freeze handle this. If not handled, Postgres will force a shutdown to protect data.
Visibility map and Index-only scans
- The visibility map flags entire pages as all-visible so index-only scans can be used without visiting heap pages.
Query planner and optimizer internals
Overview
- Planner transforms SQL parse trees into query plans with nodes (SeqScan, IndexScan, HashJoin, MergeJoin, NestLoop, etc.). It estimates costs using a cost model.
- Heuristics and dynamic programming create join orders and select join algorithms.
Statistics
- PostgreSQL keeps per-column statistics: ndistinct, mostcommon_vals, histogram bounds, correlation, etc.
- Collect statistics via ANALYZE; parameters like defaultstatisticstarget, ALTER TABLE ... ALTER COLUMN SET STATISTICS.
- Extended statistics: multi-column statistics (e.g., for correlated columns) to improve cardinality estimation.
Cost model
- Cost parameters include:
- cputuplecost
- cpuindextuple_cost
- randompagecost
- seqpagecost
- effectivecachesize (planner assumption about cached data)
- The planner uses estimated rows × per-row cost + startup costs.
Join planning
- Planner considers join orderings using dynamic search (for smaller join numbers) or genetic-like heuristics for larger joins (geqo).
- Join algorithms:
- Nested Loop: good for small inner or indexed inner scans.
- Hash Join: build hash table on join key for inner, then probe.
- Merge Join: requires sorted inputs (good when both sides are ordered).
Selectivity estimation
- Planner estimates selectivity of predicates: equality, range, IN, LIKE, etc.
- Mismatch between real distribution and statistics leads to cardinality estimation errors and bad plans.
Plan nodes and instrumentation
- EXPLAIN ANALYZE shows planner’s chosen plan and runtime timing.
JIT (Just-In-Time) compilation
- For expression evaluation and tuple deforming, Postgres can use LLVM (libgccjit or llvm) to generate machine code at planning time (since PostgreSQL 11-ish). Useful on heavy CPU workloads.
Plan caching
- Prepared statements, PL/pgSQL functions, and generic plan caching; generic plans may be used for prepared statements with parameters — the planner may choose a generic or custom plan based on usage statistics.
Parallel planning/execution
- PostgreSQL supports parallel sequential scans, parallel joins, and parallel aggregate with workers. There are constraints (e.g., not ...