PostgreSQL Internals — Concise Summary This summary captures the core concepts, architecture, storage, concurrency, recovery, performance and tooling described in the full deep dive into PostgreSQL internals. It highlights the mechanisms DBAs, developers and extension authors most commonly need to understand and tune. Introduction & History Open-source RDBMS descended from UC Berkeley POSTGRES; emphasis on correctness, extensibility and a stable internal API. Major milestones: POSTGRES research → SQL standard compliance → streaming replication & logical decoding (9.x–10+) → JIT, parallelism, improved partitioning and logical replication in recent releases. High-level Architecture Multi-process design: one backend per client plus a parent "postgres" and background worker processes (checkpointer, bgwriter, walwriter, autovacuum, stats collector, replication workers, etc.). Shared memory (shared_buffers), LWLocks, semaphores and signals for IPC. Logical components: storage manager, buffer manager, WAL subsystem, lock manager, transaction manager, planner, executor. Storage Layout Relations stored as files/segments in data directory (base/, pg_wal/, pg_tblspc). Pages (8KB default) contain headers, line pointers and tuple data. Tuple headers include t_xmin/t_xmax, command id, hint bits and ctid. Large attributes use TOAST (out‑of‑line compressed chunks). Visibility map marks pages as all-visible to enable index-only scans and speed vacuum decisions. MVCC & Transaction IDs MVCC implemented by storing multiple tuple versions in-place; visibility decided via per-backend snapshots (xmin/xmax and in-progress XIDs). 32-bit XID wraparound requires VACUUM/freeze; autovacuum and hint bits mitigate visibility checks and WAL churn. CTID points to physical row; subtransactions use sub-XIDs which are later resolved to top-level XIDs. Locking, Isolation & SSI Lock types: LWLocks for internal structures, heavyweight locks for relations/DDL, tuple-level locks via xmax semantics. Supported isolation levels: Read Committed (per-statement snapshot), Repeatable Read (transaction snapshot), Serializable implemented via Serializable Snapshot Isolation (SSI) — detects dangerous structures and aborts to ensure serializability. Predicate/SIREAD locks support phantom detection under serializable isolation; views like pg_locks and pg_stat_activity help monitoring. WAL, Checkpoints, Recovery & PITR WAL (pg_wal) records redo information; WAL-first principle—WAL flushed before data pages to guarantee durability. Checkpointer flushes buffers to bound recovery time; trade-off between checkpoint frequency and I/O spikes. Point-in-time recovery via base backup + archived WAL; logical decoding (wal_level=logical) enables logical replication and change capture. Buffer Management & Caching shared_buffers caches pages; buffers managed with descriptors, pins and a clock-like replacement policy. Background writer reduces checkpoint I/O peaks; OS page cache and shared_buffers interact—sizing depends on workload. Vacuum, Autovacuum & Bloat VACUUM reclaims dead tuples, sets hint/frozen XIDs and updates visibility map. VACUUM FULL rewrites table (exclusive lock). Autovacuum runs in background; parameters control thresholds, cost, and freeze behavior to prevent XID wraparound. HOT updates avoid index changes when possible and reduce bloat; insufficient vacuuming leads to bloat and wraparound risks. Planner & Optimizer Planner converts SQL into physical plans (SeqScan, IndexScan, HashJoin, MergeJoin, NestLoop, etc.) using cost estimates and statistics. Statistics (per-column and multicolumn) and cost GUCs (random_page_cost, effective_cache_size, cpu_tuple_cost) drive plan selection. Techniques: GEQO for many-way joins, JIT compilation for expressions, parallel planning/execution and planner hooks for extensions. Executor & Runtime Executor traverses plan tree; tuples flow via TupleTableSlot. Memory contexts manage lifecycle of allocations. Sorting and materialization use Tuplesort/Tuplestore; work_mem governs spill-to-disk behavior. Function manager (fmgr), triggers, constraints and SPI enable extensibility and in-server SQL execution. Index Types & Internals Primary types: B-tree (default), GIN, GiST, BRIN, SP-GiST. Each trades off size, maintenance cost and query shapes. Index-only scans rely on visibility map. Concurrent index builds and VACUUM handling vary by index type. Replication & High-Availability Physical (WAL streaming) and logical replication (WAL decoding → logical changes). Synchronous replication controls commit durability across nodes. Replication slots prevent premature WAL removal; logical replication supports per-table change publication/ subscription. Failover tooling (repmgr, Patroni, pg_auto_failover) and third-party multi-master solutions (BDR, Citus) address HA and distribution needs. Extensions, Hooks & Background Workers Extensions via CREATE EXTENSION and PGXS; C extensions use fmgr and require care (native code in server process). Hook APIs (planner, executor, utility) and bgworker API enable powerful server customization (pg_stat_statements, PostGIS, logical connectors). Observability & Diagnostic Tools System views: pg_stat_activity, pg_stat_database, pg_stat_statements, pg_locks, pg_stat_bgwriter, etc. EXPLAIN (ANALYZE, BUFFERS), auto_explain, pg_waldump, pageinspect, perf/eBPF and log analyzers (pgbadger) for deep diagnostics. Performance Tuning (Key GUCs) Memory: shared_buffers, effective_cache_size, work_mem, maintenance_work_mem. WAL: wal_buffers, max_wal_size, synchronous_commit, checkpoint settings. Parallelism and worker limits: max_worker_processes, max_parallel_workers_per_gather. Tune planner costs only after measurement. Tools & Debugging WAL inspection: pg_waldump, pg_recvlogical, logical decoding plugins (wal2json, pgoutput). Page and WAL inspection: pageinspect, pg_controldata, EXPLAIN ANALYZE, and crash/recovery testing with base backups + archived WAL. Current State & Future Directions Ongoing improvements: parallel query, logical replication, partitioning, JIT and indexing enhancements. Research directions: heap alternatives (ZHeap), better cardinality estimation, distributed SQL integration and richer JIT/LLVM use. Practical Examples & Common Commands Inspect sessions: SELECT from pg_stat_activity; check locks via pg_locks. EXPLAIN (ANALYZE, BUFFERS) to diagnose plans; create replication slots and run pg_basebackup for base backups. Use pageinspect and pg_waldump for low-level debugging and WAL examination. Conclusion & Reading PostgreSQL internals balance correctness and performance through MVCC, WAL, a sophisticated planner/executor and a rich extension model. Understanding these internals aids tuning, debugging and extension development. Recommended next steps include reading the official docs, browsing src/backend, using EXPLAIN ANALYZE and experimenting with small setups.

PostgreSQL Internals — A Deep Dive

This article is a comprehensive, in-depth examination of PostgreSQL internals. It covers history, architecture, storage, transaction and concurrency control, indexing, query planning and execution, background processes, durability and recovery, replication, extensions, observability, performance tuning, and future directions. Where useful, SQL and C code snippets illustrate how to inspect and extend the server.

Table of contents

Introduction and brief history
High-level architecture
Storage layout: files, pages, tuples, TOAST
MVCC, tuple visibility, and transaction IDs
Locking, isolation levels, and Serializable Snapshot Isolation (SSI)
Write-Ahead Logging (WAL), checkpoints, recovery, and PITR
Buffer management and caching
Vacuum, autovacuum, bloat, and wraparound
Query planner and optimizer internals
Executor and runtime components
Index types and internals (B-tree, GIN, GiST, BRIN, SP-GiST)
Replication and high-availability (physical and logical)
Extensions, hooks, and background workers
Observability, statistics, and diagnostic tools
Performance tuning: key GUCs and practical knobs
Tools, debugging and reading WAL
Current state and future directions
Practical examples and code snippets
Conclusion and recommended reading

Introduction and brief history

PostgreSQL is a mature, open-source relational database system descended from the POSTGRES project at UC Berkeley (Stonebraker et al.). It has supported advanced features like extensibility, complex SQL, and MVCC for decades. PostgreSQL’s modern architecture evolved to emphasize correctness, extensibility, and a stable interface for extensions and storage.

Key historical milestones:

POSTGRES (1986–1994): object-relational research.
PostgreSQL 6.x–8.x: stabilization and SQL standard compliance.
PostgreSQL 9.x: streaming replication, hot standby, and logical decoding.
PostgreSQL 10+ : native logical replication, declarative partitioning improvements.
Recent: JIT compilation for expressions (libgccjit), parallel query improvements, improved partitioning, incremental improvements to replication and monitoring.

High-level architecture

PostgreSQL uses a multi-process architecture: each client connection is a separate OS process (a server process). There is a "postmaster" (now just the "postgres" parent process) that accepts connections and forks worker processes.

Key process types:

Backend processes: one per client connection.
Background worker processes:
checkpointer (checkpointer)
background writer (bgwriter)
wal writer (walwriter)
wal sender / wal receiver (replication)
autovacuum launcher and workers
stats collector
logical replication workers
archiver (optional)
Shared memory (Postgres shared buffers, locks, etc.)
Semaphores, spinlocks, lightweight locks (LWLocks)

Inter-process communication is via shared memory, semaphores, and signals. The server uses POSIX primitives where available.

Main components (logical):

Storage manager (heap, files, forks)
Buffer manager (shared_buffers)
WAL subsystem (XLOG)
Lock manager (lightweight locks for data structures; heavyweight locks for DDL)
Transaction manager (XID handling, subtransactions)
Query planner and optimizer
Executor
Logical replication & logical decoding

Storage layout: files, pages, tuples, TOAST

Physical storage:

Data stored in the cluster's data directory under base/ (default tablespaces), pgtblspc (tablespaces), global/, pgwal (WAL).
Each relation (table/index) is represented by one or more files (1GB segment limit). Several forks per relation:
main (heap/index)
fsm (free space map) — legacy in older versions, now replaced by FSM management in shared buffers? (Note: older versions used filesystem fsm fork)
vm (visibility map)
init (initial content)
Files are in a binary layout of pages (8 KB by default) called blocks.

Page format:

Page header
Item ID (line pointer) array
Tuple data area (grows from end toward the line pointer array)
Special space (index-specific)
Checksum (optional)

Heap tuple header includes:

txmin (inserting transaction id), txmax (deleting/locking TX)
t_cid (command id)
t_infomask bits (visibility flags, hint bits)
t_ctid (pointer to physical row version)
optional frozen/xmin/xmax markers

TOAST

"The Oversized-Attribute Storage Technique"; large attributes (TOAST-able columns like large text or bytea) are stored out-of-line in a TOAST table with compression and chunking. This keeps main heap pages smaller and reduces I/O for common operations.

Visibility map

A map recording pages that are all-visible (no tuples require visibility checks). Helps vacuum and index-only scans.

MVCC, tuple visibility, and transaction IDs

PostgreSQL implements MVCC by storing multiple versions of tuples (row versions) in-place (updates create new tuple versions in the same relation file). There is no centralized version store or delta; instead each tuple carries xmin/xmax.

Visibility logic

A backend uses a snapshot to decide which tuples are visible:
tuples with xmin committed before snapshot and not deleted before snapshot are visible
tuples with xmax set to a transaction committed before snapshot are invisible
Snapshot contains information: xmin, xmax, and an array of in-progress XIDs.

Transaction IDs (XIDs)

32-bit wraparound: requires freezing of old rows via VACUUM to prevent incorrect visibility after wraparound.
Frozen transactions mark tuples as effectively immortal (use FrozenXID).
autovacuum/vacuum freeze aims to prevent XID wraparound.

Hint bits

To avoid consulting WAL on visibility repeatedly, backends set hint bits in the tuple header once a tuple status (committed/aborted) is known. These hint bits are non-logged (but can be set using WAL if WAL-logged hint bit updates are disabled? Typically hint bit changes are not WAL-logged; but their presence is safe because they encode info that can be recomputed).

CTID

After updates, the old row remains; the new version gets a new physical location. CTID refers to the physical location allowing e.g. internal navigation or updates by physical tuple reference.

Subtransactions (savepoints)

Implemented using sub-XIDs. When a subtransaction commits, its XIDs are converted into a top-level XID, with translation arrays.

Snapshot types

MVCC snapshot (current/older view)
Historic snapshots for logical decoding or serializable transactions.

Locking, isolation levels, and Serializable Snapshot Isolation (SSI)

Lock manager

Lightweight locks (LWLocks): protect buffer control structures and data structures inside PostgreSQL.
Heavyweight locks: relation-level locks (AccessShareLock, AccessExclusiveLock, etc.) used for DDL and concurrency control.
Tuple-level locking: Use of xmin/xmax and lock tuples via updating xmax with special lock parameters.

Isolation levels

Read Uncommitted: maps to Read Committed behavior (Postgres does not support dirty reads).
Read Committed: each statement sees a snapshot as of the start of the statement; new snapshot per statement.
Repeatable Read: fixed snapshot for the whole transaction (since Postgres 9.1, Repeatable Read used to be serializable? Now Postgres implements Serializable as SSI).
Serializable: implemented by Serializable Snapshot Isolation (SSI), which detects dangerous structures and aborts transactions that would violate serializability.

Serializable Snapshot Isolation (SSI)

Allows snapshot isolation semantics but detects write-read or read-write conflicts forming dangerous structures and aborts one of the transactions.
Postgres tracks read/write dependencies with predicate locking utilities and conflicts to maintain serializability.

Predicate locks and index-only locking

Postgres uses predicate locks logically for range queries via index-based predicate checks (e.g., during range scans) to detect phantoms under Serializable isolation. Implementation uses SIREAD locks (no physical lock is attached to tuples, it's recorded in a separate structure).

Two-phase commit

Prepare and commit for distributed transactions with external resource managers.

Lock monitoring

Views: pglocks, pgstatactivity, pgstat_activity shows waiting/blocked states.

Write-Ahead Logging (WAL), checkpoints, recovery, and PITR

WAL fundamentals

WAL records every change needed to reconstruct data pages; writes WAL to disk before pages are flushed (WAL-first principle). This enables crash recovery.
WAL files live in pgwal/ (or pgxlog/ pre-9.5). WAL segments are 16 MB by default.

WAL insert and flush

XLogInsert builds a WAL record. WAL writer flushes groups of WAL records to disk (synchronous or asynchronous depending on commit settings).
Commit record ensures durability of transaction.

Checkpoints

Checkpointer scans buffers and writes dirty buffers to disk to make WAL replay shorter. Checkpoints write checkpoint WAL records and update pg_control.
Frequent checkpoints reduce recovery time but increase I/O.

Recovery and PITR

pg_basebackup or file copy for base backup.
Archive WAL segments (archive_command) for point-in-time recovery.
Recovery applies WAL records from base backup forward to reconstruct consistent state.
Timelines track branches from recovery.

Synchronous and asynchronous commit

synchronous_commit controls whether a transaction waits for WAL and/or commit to be flushed.
fsync and fullpagewrites settings influence durability and performance.

Logical decoding (logical WAL)

WAL can be decoded into logical changes for replication or auditing using logical decoding plugins (wal2json, pgoutput). Logical decoding relies on consistent snapshots and retention of WAL via replication slots.

WAL configuration keys

wal_level (minimal, replica, logical)
maxwalsenders
walkeepsize
maxreplicationslots
checkpointtimeout, checkpointcompletion_target
maxwalsize, minwalsize
synchronous_commit

WAL and parallelism

WAL must serialize some changes, but many types of queries and index builds can be parallelized (with careful WAL behavior).

WAL archiving and streaming replication

WAL segments can be archived and replayed by standbys; replication can stream WAL.

Tools: pgwaldump, pgxlogdump (older), pg_receivewal, logical decoding plugins.

Buffer management and caching

Shared buffers

shared_buffers holds cached disk pages (8 KB each).
Buffers are protected by buffer headers and lightweight locks (buf_hdr lock) and managed by a replacement strategy.

Replacement strategy

Historically a clock sweep algorithm with improvements. Buffer pins prevent eviction of pages in use.

Checkpoint and bgwriter

Background writer proactively writes dirty buffers to reduce checkpoint I/O spikes.
Checkpointer writes remaining dirty buffers during checkpoint.

OS cache

Postgres relies on OS page cache; sharedbuffers is not the entire cache. There is interplay between sharedbuffers and OS cache; recommended sizing depends on workload.

Buffer lifecycle considerations

Read: buffer read into shared_buffers -> pinned -> used -> unpinned.
Write: modification marks buffer dirty and WAL is emitted for correct page updates.

Buffer management internals

Buffer descriptor: keeps flags, refcount, lock, tag (block id), usage_count for eviction.

Vacuum, autovacuum, bloat, and wraparound

VACUUM

Reclaims space from dead tuples and makes tuples visible for reuse.
Vacuum updates visibility map and may set hint bits and freeze tuples.
VACUUM FULL rewrites the table, reclaiming physical disk space (exclusive lock).

Autovacuum

Background autovacuum workers run automatically to clean tables and prevent wraparound.
Autovacuum settings include autovacuumvacuumthreshold, autovacuumvacuumscalefactor, autovacuumanalyzescalefactor, autovacuummaxworkers, autovacuumnaptime, autovacuumfreezemaxage.

Bloat

Dead tuple accumulation increases table size and slows scans.
Regular autovacuum tuning and aggressive VACUUMs in high-update workloads reduce bloat.
Partitioning and HOT updates reduce bloat.

HOT (Heap-Only Tuple) updates

Update that creates a new tuple in the same page without updating indexes when index columns are unchanged, avoiding index maintenance and reducing WAL.
HOT chain: multiple HOT updates can chain, but can complicate vacuum.

Wraparound protection

Because XIDs are 32-bit, Postgres must freeze tuples older than a threshold; autovacuum+vacuum freeze handle this. If not handled, Postgres will force a shutdown to protect data.

Visibility map and Index-only scans

The visibility map flags entire pages as all-visible so index-only scans can be used without visiting heap pages.

Query planner and optimizer internals

Overview

Planner transforms SQL parse trees into query plans with nodes (SeqScan, IndexScan, HashJoin, MergeJoin, NestLoop, etc.). It estimates costs using a cost model.
Heuristics and dynamic programming create join orders and select join algorithms.

Statistics

PostgreSQL keeps per-column statistics: ndistinct, mostcommon_vals, histogram bounds, correlation, etc.
Collect statistics via ANALYZE; parameters like defaultstatisticstarget, ALTER TABLE ... ALTER COLUMN SET STATISTICS.
Extended statistics: multi-column statistics (e.g., for correlated columns) to improve cardinality estimation.

Cost model

Cost parameters include:
cputuplecost
cpuindextuple_cost
randompagecost
seqpagecost
effectivecachesize (planner assumption about cached data)
The planner uses estimated rows × per-row cost + startup costs.

Join planning

Planner considers join orderings using dynamic search (for smaller join numbers) or genetic-like heuristics for larger joins (geqo).
Join algorithms:
Nested Loop: good for small inner or indexed inner scans.
Hash Join: build hash table on join key for inner, then probe.
Merge Join: requires sorted inputs (good when both sides are ordered).

Selectivity estimation

Planner estimates selectivity of predicates: equality, range, IN, LIKE, etc.
Mismatch between real distribution and statistics leads to cardinality estimation errors and bad plans.

Plan nodes and instrumentation

EXPLAIN ANALYZE shows planner’s chosen plan and runtime timing.

JIT (Just-In-Time) compilation

For expression evaluation and tuple deforming, Postgres can use LLVM (libgccjit or llvm) to generate machine code at planning time (since PostgreSQL 11-ish). Useful on heavy CPU workloads.

Plan caching

Prepared statements, PL/pgSQL functions, and generic plan caching; generic plans may be used for prepared statements with parameters — the planner may choose a generic or custom plan based on usage statistics.

Parallel planning/execution

PostgreSQL supports parallel sequential scans, parallel joins, and parallel aggregate with workers. There are constraints (e.g., not ...

Postagres Internals

PostgreSQL in 100 Seconds

Databases In-Depth – Complete Course

PostgreSQL vs MySQL

99% of Developers Don't Get PostgreSQL

PostgreSQL Internal Architecture Explained

Atlassian migrated 4 million postgres databases with ZERO downtime