PostgreSQL Internals — A Deep Dive

This article is a comprehensive, in-depth examination of PostgreSQL internals. It covers history, architecture, storage, transaction and concurrency control, indexing, query planning and execution, background processes, durability and recovery, replication, extensions, observability, performance tuning, and future directions. Where useful, SQL and C code snippets illustrate how to inspect and extend the server.

Table of contents

  • Introduction and brief history
  • High-level architecture
  • Storage layout: files, pages, tuples, TOAST
  • MVCC, tuple visibility, and transaction IDs
  • Locking, isolation levels, and Serializable Snapshot Isolation (SSI)
  • Write-Ahead Logging (WAL), checkpoints, recovery, and PITR
  • Buffer management and caching
  • Vacuum, autovacuum, bloat, and wraparound
  • Query planner and optimizer internals
  • Executor and runtime components
  • Index types and internals (B-tree, GIN, GiST, BRIN, SP-GiST)
  • Replication and high-availability (physical and logical)
  • Extensions, hooks, and background workers
  • Observability, statistics, and diagnostic tools
  • Performance tuning: key GUCs and practical knobs
  • Tools, debugging and reading WAL
  • Current state and future directions
  • Practical examples and code snippets
  • Conclusion and recommended reading

Introduction and brief history

PostgreSQL is a mature, open-source relational database system descended from the POSTGRES project at UC Berkeley (Stonebraker et al.). It has supported advanced features like extensibility, complex SQL, and MVCC for decades. PostgreSQL’s modern architecture evolved to emphasize correctness, extensibility, and a stable interface for extensions and storage.

Key historical milestones:

  • POSTGRES (1986–1994): object-relational research.
  • PostgreSQL 6.x–8.x: stabilization and SQL standard compliance.
  • PostgreSQL 9.x: streaming replication, hot standby, and logical decoding.
  • PostgreSQL 10+ : native logical replication, declarative partitioning improvements.
  • Recent: JIT compilation for expressions (libgccjit), parallel query improvements, improved partitioning, incremental improvements to replication and monitoring.

High-level architecture

PostgreSQL uses a multi-process architecture: each client connection is a separate OS process (a server process). There is a "postmaster" (now just the "postgres" parent process) that accepts connections and forks worker processes.

Key process types:

  • Backend processes: one per client connection.
  • Background worker processes:
    • checkpointer (checkpointer)
    • background writer (bgwriter)
    • wal writer (walwriter)
    • wal sender / wal receiver (replication)
    • autovacuum launcher and workers
    • stats collector
    • logical replication workers
    • archiver (optional)
  • Shared memory (Postgres shared buffers, locks, etc.)
  • Semaphores, spinlocks, lightweight locks (LWLocks)

Inter-process communication is via shared memory, semaphores, and signals. The server uses POSIX primitives where available.

Main components (logical):

  • Storage manager (heap, files, forks)
  • Buffer manager (shared_buffers)
  • WAL subsystem (XLOG)
  • Lock manager (lightweight locks for data structures; heavyweight locks for DDL)
  • Transaction manager (XID handling, subtransactions)
  • Query planner and optimizer
  • Executor
  • Logical replication & logical decoding

Storage layout: files, pages, tuples, TOAST

Physical storage:

  • Data stored in the cluster's data directory under base/ (default tablespaces), pg_tblspc (tablespaces), global/, pg_wal (WAL).
  • Each relation (table/index) is represented by one or more files (1GB segment limit). Several forks per relation:
    • main (heap/index)
    • fsm (free space map) — legacy in older versions, now replaced by FSM management in shared buffers? (Note: older versions used filesystem fsm fork)
    • vm (visibility map)
    • init (initial content)
  • Files are in a binary layout of pages (8 KB by default) called blocks.

Page format:

  • Page header
  • Item ID (line pointer) array
  • Tuple data area (grows from end toward the line pointer array)
  • Special space (index-specific)
  • Checksum (optional)

Heap tuple header includes:

  • t_xmin (inserting transaction id), t_xmax (deleting/locking TX)
  • t_cid (command id)
  • t_infomask bits (visibility flags, hint bits)
  • t_ctid (pointer to physical row version)
  • optional frozen/xmin/xmax markers

TOAST

  • "The Oversized-Attribute Storage Technique"; large attributes (TOAST-able columns like large text or bytea) are stored out-of-line in a TOAST table with compression and chunking. This keeps main heap pages smaller and reduces I/O for common operations.

Visibility map

  • A map recording pages that are all-visible (no tuples require visibility checks). Helps vacuum and index-only scans.

MVCC, tuple visibility, and transaction IDs

PostgreSQL implements MVCC by storing multiple versions of tuples (row versions) in-place (updates create new tuple versions in the same relation file). There is no centralized version store or delta; instead each tuple carries xmin/xmax.

Visibility logic

  • A backend uses a snapshot to decide which tuples are visible:
    • tuples with xmin committed before snapshot and not deleted before snapshot are visible
    • tuples with xmax set to a transaction committed before snapshot are invisible
  • Snapshot contains information: xmin, xmax, and an array of in-progress XIDs.

Transaction IDs (XIDs)

  • 32-bit wraparound: requires freezing of old rows via VACUUM to prevent incorrect visibility after wraparound.
  • Frozen transactions mark tuples as effectively immortal (use FrozenXID).
  • autovacuum/vacuum freeze aims to prevent XID wraparound.

Hint bits

  • To avoid consulting WAL on visibility repeatedly, backends set hint bits in the tuple header once a tuple status (committed/aborted) is known. These hint bits are non-logged (but can be set using WAL if WAL-logged hint bit updates are disabled? Typically hint bit changes are not WAL-logged; but their presence is safe because they encode info that can be recomputed).

CTID

  • After updates, the old row remains; the new version gets a new physical location. CTID refers to the physical location allowing e.g. internal navigation or updates by physical tuple reference.

Subtransactions (savepoints)

  • Implemented using sub-XIDs. When a subtransaction commits, its XIDs are converted into a top-level XID, with translation arrays.

Snapshot types

  • MVCC snapshot (current/older view)
  • Historic snapshots for logical decoding or serializable transactions.

Locking, isolation levels, and Serializable Snapshot Isolation (SSI)

Lock manager

  • Lightweight locks (LWLocks): protect buffer control structures and data structures inside PostgreSQL.
  • Heavyweight locks: relation-level locks (AccessShareLock, AccessExclusiveLock, etc.) used for DDL and concurrency control.
  • Tuple-level locking: Use of xmin/xmax and lock tuples via updating xmax with special lock parameters.

Isolation levels

  • Read Uncommitted: maps to Read Committed behavior (Postgres does not support dirty reads).
  • Read Committed: each statement sees a snapshot as of the start of the statement; new snapshot per statement.
  • Repeatable Read: fixed snapshot for the whole transaction (since Postgres 9.1, Repeatable Read used to be serializable? Now Postgres implements Serializable as SSI).
  • Serializable: implemented by Serializable Snapshot Isolation (SSI), which detects dangerous structures and aborts transactions that would violate serializability.

Serializable Snapshot Isolation (SSI)

  • Allows snapshot isolation semantics but detects write-read or read-write conflicts forming dangerous structures and aborts one of the transactions.
  • Postgres tracks read/write dependencies with predicate locking utilities and conflicts to maintain serializability.

Predicate locks and index-only locking

  • Postgres uses predicate locks logically for range queries via index-based predicate checks (e.g., during range scans) to detect phantoms under Serializable isolation. Implementation uses SIREAD locks (no physical lock is attached to tuples, it's recorded in a separate structure).

Two-phase commit

  • Prepare and commit for distributed transactions with external resource managers.

Lock monitoring

  • Views: pg_locks, pg_stat_activity, pg_stat_activity shows waiting/blocked states.

Write-Ahead Logging (WAL), checkpoints, recovery, and PITR

WAL fundamentals

  • WAL records every change needed to reconstruct data pages; writes WAL to disk before pages are flushed (WAL-first principle). This enables crash recovery.
  • WAL files live in pg_wal/ (or pg_xlog/ pre-9.5). WAL segments are 16 MB by default.

WAL insert and flush

  • XLogInsert builds a WAL record. WAL writer flushes groups of WAL records to disk (synchronous or asynchronous depending on commit settings).
  • Commit record ensures durability of transaction.

Checkpoints

  • Checkpointer scans buffers and writes dirty buffers to disk to make WAL replay shorter. Checkpoints write checkpoint WAL records and update pg_control.
  • Frequent checkpoints reduce recovery time but increase I/O.

Recovery and PITR

  • pg_basebackup or file copy for base backup.
  • Archive WAL segments (archive_command) for point-in-time recovery.
  • Recovery applies WAL records from base backup forward to reconstruct consistent state.
  • Timelines track branches from recovery.

Synchronous and asynchronous commit

  • synchronous_commit controls whether a transaction waits for WAL and/or commit to be flushed.
  • fsync and full_page_writes settings influence durability and performance.

Logical decoding (logical WAL)

  • WAL can be decoded into logical changes for replication or auditing using logical decoding plugins (wal2json, pgoutput). Logical decoding relies on consistent snapshots and retention of WAL via replication slots.

WAL configuration keys

  • wal_level (minimal, replica, logical)
  • max_wal_senders
  • wal_keep_size
  • max_replication_slots
  • checkpoint_timeout, checkpoint_completion_target
  • max_wal_size, min_wal_size
  • synchronous_commit

WAL and parallelism

  • WAL must serialize some changes, but many types of queries and index builds can be parallelized (with careful WAL behavior).

WAL archiving and streaming replication

  • WAL segments can be archived and replayed by standbys; replication can stream WAL.

Tools: pg_waldump, pg_xlogdump (older), pg_receivewal, logical decoding plugins.


Buffer management and caching

Shared buffers

  • shared_buffers holds cached disk pages (8 KB each).
  • Buffers are protected by buffer headers and lightweight locks (buf_hdr lock) and managed by a replacement strategy.

Replacement strategy

  • Historically a clock sweep algorithm with improvements. Buffer pins prevent eviction of pages in use.

Checkpoint and bgwriter

  • Background writer proactively writes dirty buffers to reduce checkpoint I/O spikes.
  • Checkpointer writes remaining dirty buffers during checkpoint.

OS cache

  • Postgres relies on OS page cache; shared_buffers is not the entire cache. There is interplay between shared_buffers and OS cache; recommended sizing depends on workload.

Buffer lifecycle considerations

  • Read: buffer read into shared_buffers -> pinned -> used -> unpinned.
  • Write: modification marks buffer dirty and WAL is emitted for correct page updates.

Buffer management internals

  • Buffer descriptor: keeps flags, refcount, lock, tag (block id), usage_count for eviction.

Vacuum, autovacuum, bloat, and wraparound

VACUUM

  • Reclaims space from dead tuples and makes tuples visible for reuse.
  • Vacuum updates visibility map and may set hint bits and freeze tuples.
  • VACUUM FULL rewrites the table, reclaiming physical disk space (exclusive lock).

Autovacuum

  • Background autovacuum workers run automatically to clean tables and prevent wraparound.
  • Autovacuum settings include autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, autovacuum_analyze_scale_factor, autovacuum_max_workers, autovacuum_naptime, autovacuum_freeze_max_age.

Bloat

  • Dead tuple accumulation increases table size and slows scans.
  • Regular autovacuum tuning and aggressive VACUUMs in high-update workloads reduce bloat.
  • Partitioning and HOT updates reduce bloat.

HOT (Heap-Only Tuple) updates

  • Update that creates a new tuple in the same page without updating indexes when index columns are unchanged, avoiding index maintenance and reducing WAL.
  • HOT chain: multiple HOT updates can chain, but can complicate vacuum.

Wraparound protection

  • Because XIDs are 32-bit, Postgres must freeze tuples older than a threshold; autovacuum+vacuum freeze handle this. If not handled, Postgres will force a shutdown to protect data.

Visibility map and Index-only scans

  • The visibility map flags entire pages as all-visible so index-only scans can be used without visiting heap pages.

Query planner and optimizer internals

Overview

  • Planner transforms SQL parse trees into query plans with nodes (SeqScan, IndexScan, HashJoin, MergeJoin, NestLoop, etc.). It estimates costs using a cost model.
  • Heuristics and dynamic programming create join orders and select join algorithms.

Statistics

  • PostgreSQL keeps per-column statistics: n_distinct, most_common_vals, histogram bounds, correlation, etc.
  • Collect statistics via ANALYZE; parameters like default_statistics_target, ALTER TABLE ... ALTER COLUMN SET STATISTICS.
  • Extended statistics: multi-column statistics (e.g., for correlated columns) to improve cardinality estimation.

Cost model

  • Cost parameters include:
    • cpu_tuple_cost
    • cpu_index_tuple_cost
    • random_page_cost
    • seq_page_cost
    • effective_cache_size (planner assumption about cached data)
  • The planner uses estimated rows × per-row cost + startup costs.

Join planning

  • Planner considers join orderings using dynamic search (for smaller join numbers) or genetic-like heuristics for larger joins (geqo).
  • Join algorithms:
    • Nested Loop: good for small inner or indexed inner scans.
    • Hash Join: build hash table on join key for inner, then probe.
    • Merge Join: requires sorted inputs (good when both sides are ordered).

Selectivity estimation

  • Planner estimates selectivity of predicates: equality, range, IN, LIKE, etc.
  • Mismatch between real distribution and statistics leads to cardinality estimation errors and bad plans.

Plan nodes and instrumentation

  • EXPLAIN ANALYZE shows planner’s chosen plan and runtime timing.

JIT (Just-In-Time) compilation

  • For expression evaluation and tuple deforming, Postgres can use LLVM (libgccjit or llvm) to generate machine code at planning time (since PostgreSQL 11-ish). Useful on heavy CPU workloads.

Plan caching

  • Prepared statements, PL/pgSQL functions, and generic plan caching; generic plans may be used for prepared statements with parameters — the planner may choose a generic or custom plan based on usage statistics.

Parallel planning/execution

  • PostgreSQL supports parallel sequential scans, parallel joins, and parallel aggregate with workers. There are constraints (e.g., not all functions are parallel-safe).

Planner hooks

  • Extension points exist to influence planning (planner hooks) for custom behavior.

Executor and runtime components

Executor responsibilities

  • Traverse plan tree to execute operations and produce tuples.
  • Executor nodes implement the logic for each node type and maintain state.

Tuple slot mechanism

  • Tuples are passed between nodes via TupleTableSlot, allowing abstraction over physical tuple types.

Memory contexts

  • PostgreSQL uses a hierarchical memory context allocator to manage allocations with automatic deallocation at context reset. Typical contexts: per-query, per-transaction, per-executor node.

Tuplesort and tuplestore

  • For sorts and materialization, PostgreSQL uses Tuplesort (external sort) and Tuplestore (write to disk if needed).
  • Work memory (work_mem) affects the threshold to spill to disk.

Expression evaluation

  • ExecEval* functions evaluate expressions; expression trees are compiled into executable structures at plan time.
  • Function calls use fmgr interface and can be marked parallel-safe / leak-proof etc.

Triggers and constraints

  • Triggers and constraints are enforced by executor at the right points (BEFORE, AFTER).
  • Foreign key actions may require extra locking and checking.

SPI (Server Programming Interface)

  • Internal API for C-language functions to run SQL statements in server context.

Executor instrumentation

  • Execution statistics can be exposed via EXPLAIN ANALYZE and pg_stat_statements.

Index types and internals

PostgreSQL supports multiple index types, each with different tradeoffs.

B-tree (default)

  • Balanced tree of pages; supports equality and range scans.
  • Split and deletion handling; write-ahead-logged operations.
  • Page-level structure with leaf pages, internal pages, and root.
  • Concurrency via WAL and locks; VACUUM cleans dead index entries (POSTGRES has index vacuum to remove entries pointing to dead tuples).
  • MVCC complicates index maintenance: index entries point to heaps that may be dead.

GIN (Generalized Inverted Index)

  • Good for indexing multi-valued items (e.g., array elements, full text).
  • Has posting lists and uses a posting tree for large lists.
  • Supports fast retrieval of matching items with OR/AND semantics.
  • More complex WAL behavior and VACUUM.

GiST (Generalized Search Tree)

  • A framework for tree-based indexes like R-tree, KNN on geometric data, etc.
  • Supports custom consistent functions and penalties for split heuristics.
  • Not as strongly balanced as B-tree; relies on recheck.

BRIN (Block Range Index)

  • Block-range index storing summaries per range (min/max).
  • Extremely small index size but useful for naturally ordered data (e.g., time).
  • Cheap to maintain, low I/O benefit when data correlated.

SP-GiST (Space-partitioned GiST)

  • Space-partitioned tree, efficient for some data distributions.

Index-only scans

  • Use visibility map to avoid heap fetch if page marked all-visible.

Concurrent index builds

  • Create index CONCURRENTLY to avoid exclusive locks; slower but allows reads/writes during creation.

Index maintenance

  • Insert/update: index entries appended; updates might require deletion of old index entries and insertion of new ones.
  • Index-only scans and vacuuming interact with visibility and bloat.

Replication and high-availability (physical and logical)

Physical replication (streaming)

  • WAL-based streaming replication: standby continuously receives and applies WAL records.
  • Synchronous replication: commit on primary waits for standby confirmation (zero data loss when configured).
  • Hot standby: standby can accept read-only queries while applying WAL.

Physical base backup

  • pg_basebackup or file-based copy while WAL archiving ensures consistent base backups.

Replication features & config

  • wal_level (replica, logical)
  • max_wal_senders
  • primary_conninfo (on standby)
  • recovery.signal/recovery.conf (older versions)

Logical replication

  • Publishes changes at logical level via WAL decoding and applies them to subscribers via logical replication protocol.
  • row-level replication by default; can replicate selected tables.
  • Replication slots: keep WAL required for logical decoding from being removed.

Logical decoding

  • Plugins (pgoutput, wal2json) decode WAL into logical changes.
  • Requires wal_level=logical and appropriate snapshot handling.

Cascading replication and multi-primary

  • Cascading standby can stream from downstream standby.
  • BDR and other tools provide multi-master synchronization (third-party or external extensions).

Conflict resolution

  • Logical replication is sensitive to conflicts; subscribers may need conflict detection/resolution logic.

Failover and switchover

  • Tools: repmgr, Patroni, Pacemaker, pg_auto_failover for orchestrating failover/switchover.

Extensions, hooks, and background workers

Extension architecture

  • Extensions can be packaged with control files and SQL script files and installed via CREATE EXTENSION.
  • PGXS lets C extensions be compiled and distributed.

Loadable modules

  • .so libraries loaded by shared_preload_libraries or via CREATE FUNCTION ... AS 'library'.

Server programming interfaces

  • API to build functions, index access methods, background workers, custom scan nodes, planner hooks, executor hooks, utility hooks.

Background workers

  • Extensions can register background workers using bgworker API to run persistent tasks (e.g., logical replication connectors, monitoring agents).

Key extension examples

  • pg_stat_statements: tracks statement execution statistics.
  • PostGIS: adds geospatial types, operators, and index support (GiST, GIN).
  • citus: sharding and distributed query processing (needs both DB and extension).
  • pgaudit: auditing.

Security considerations

  • Extensions loaded into server process must be trusted or controlled since they execute native code.

Example skeleton for C extension (simplified):

Plain Text
1#include "postgres.h" 2#include "fmgr.h" 3 4PG_MODULE_MAGIC; 5 6PG_FUNCTION_INFO_V1(hello_world); 7 8Datum 9hello_world(PG_FUNCTION_ARGS) 10{ 11 text *t = cstring_to_text("Hello, PostgreSQL extension world!"); 12 PG_RETURN_TEXT_P(t); 13}

Build with PGXS and install via CREATE FUNCTION linking to the shared library.


Observability, statistics, and diagnostic tools

System views and functions:

  • pg_stat_activity — current sessions and queries
  • pg_stat_database — database-level statistics (xact_commit, xact_rollback, blks_read, blks_hit)
  • pg_stat_statements (extension) — per-statement cumulative stats
  • pg_locks — current locks
  • pg_statio_* — I/O statistics for tables and indices
  • pg_stat_user_tables, pg_stat_user_indexes — per-object statistics
  • pg_stat_bgwriter — bgwriter/checkpoint statistics
  • pg_stat_replication — replication sender info
  • pg_visibility, pageinspect extension — inspect page contents

Logging

  • postgresql.conf: logging_collector, log_min_duration_statement, log_statement, log_line_prefix
  • log_parser and other external tools parse logs for analysis.

Profilers and performance tools

  • EXPLAIN ANALYZE — runtime execution details and per-node timing and rows.
  • auto_explain extension — log slow plans automatically.
  • Perf, eBPF, DTrace (where supported) for OS-level profiling.
  • pgbadger — parses logs to create reports.
  • pgFouine, pganalyze, pgmetrics, pg_stat_monitor (alternative to pg_stat_statements).

Inspecting internal state

  • pg_waldump — inspect WAL segments
  • pg_xlogdump (older)
  • pg_basebackup for base backups
  • pg_controldata for cluster control information

Performance tuning: key GUCs and practical knobs

Memory

  • shared_buffers: memory for shared buffer pool (commonly 25% of RAM, but depends on workload)
  • effective_cache_size: planner estimate of available OS cache
  • work_mem: per-sort/operation memory; controls whether sorts/hash build spill to disk
  • maintenance_work_mem: used for VACUUM, CREATE INDEX, etc.

WAL and durability

  • checkpoint_timeout, checkpoint_completion_target
  • wal_buffers: size for WAL insertion
  • max_wal_size and min_wal_size: control checkpoint frequency (modern knobs to reduce checkpoint storms)
  • synchronous_commit: controls how strict durability is (on/off/local)
  • commit_delay/commit_siblings: low-level tuning for commit grouping

Autovacuum

  • autovacuum_vacuum_threshold/scale, autovacuum_max_workers, autovacuum_vacuum_cost_limit, autovacuum_vacuum_cost_delay.

I/O and disk

  • random_page_cost/seq_page_cost: affects planner cost preferences
  • use of SSDs reduces random_page_cost significantly
  • effective_io_concurrency (on some platforms) for parallel I/O during index builds and vacuum

Connections

  • max_connections and connection pooling: consider PgBouncer/ PgPool to mitigate process-per-connection overhead.

CPU and parallelism

  • max_worker_processes
  • max_parallel_workers_per_gather
  • max_parallel_workers
  • parallel_setup_cost, parallel_tuple_cost

Index usage and maintenance

  • consider partial indexes, psql ANALYZE frequency, targeted indexes (covering + index-only scans), expressional indexes.

Practical tuning approach

  • Start with baseline (reasonable shared_buffers, autovacuum on)
  • Address queries with EXPLAIN ANALYZE
  • Tune planner parameters (effective_cache_size and random_page_cost) only after measuring
  • Use connection pooling for high-concurrency workloads
  • Monitor autovacuum, bloat, and tune vacuum cost settings when autovacuum is too slow.

Tools, debugging and reading WAL

Reading WAL

  • pg_waldump: read WAL segments in human-friendly form.
  • pg_recvlogical / logical decoding plugins for logical replication output.
  • pg_xlogdump (older) provides similar inspection.

Inspecting pages

  • pageinspect extension to view raw page contents, tuples, and tuples status.

Disk debugging

  • psql queries:
SQL
1-- Check current WAL write status 2SELECT * FROM pg_stat_bgwriter; 3 4-- Check locks and blocking 5SELECT blocked_locks.pid AS blocked_pid, 6 blocked_activity.usename AS blocked_user, 7 blocking_locks.pid AS blocking_pid, 8 blocking_activity.usename AS blocking_user 9FROM pg_locks blocked_locks 10JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid 11JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database 12JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid 13WHERE NOT blocked_locks.granted 14 AND blocking_locks.granted;

Analyzing queries

  • EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to capture planner vs executor behavior.

WAL archiving test

  • Configure archive_command to copy WAL files to a safe location and test recovery with pg_restore and recovery.conf or recovery.signal.

Crash simulation

  • Use abrupt kill of server process and check WAL-based crash recovery upon restart.

Profiling

  • use perf, eBPF traces, or lttng to get OS-level timelines.

Current state and future directions

State as of 2024 (broad strokes)

  • Continual improvements in parallel query, logical replication, partitioning, and JIT provide better throughput.
  • Indexing and storage optimization (BRIN improvements, bloom indexes) support large analytical datasets.
  • Increasing focus on observability and cloud-native replication tooling (pglogical, built-in logical replication).
  • Performance improvements and features like incremental sorting, deduplication, and optimized memory management are ongoing.

Research/future directions and experiments

  • Alternatives to heap such as ZHeap (research/experimental) to reduce bloat and improve visibility management.
  • Improved cardinality estimation using better statistics, extended histograms, or machine learning approaches.
  • Transparent sharding and distributed SQL (e.g., Citus) pushing PG internals towards distributed cluster-awareness.
  • More robust multi-master replication alternatives (BDR, decentralized CRDT-based approaches).
  • Further JIT and LLVM optimization adoption and pervasively faster expression evaluation.
  • Row-level security and more fine-grained enforcement.

Practical examples and code snippets

Query patterns to inspect internals:

  1. Check active queries and blocking:
SQL
SELECT pid, usename, state, query_start, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state <> 'idle';
  1. Check lock statuses:
SQL
1SELECT pid, mode, granted, relation::regclass 2FROM pg_locks l LEFT JOIN pg_class c ON l.relation = c.oid 3WHERE database = (SELECT oid FROM pg_database WHERE datname = current_database()) 4ORDER BY pid;
  1. Observe WAL and checkpoint information:
SQL
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time FROM pg_stat_bgwriter;
  1. Use EXPLAIN ANALYZE with buffers:
SQL
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM orders WHERE order_date BETWEEN '2023-01-01' AND '2023-01-31';
  1. Create and use a replication slot:
SQL
1-- on primary 2SELECT * FROM pg_create_physical_replication_slot('my_slot'); 3 4-- or for logical 5SELECT * FROM pg_create_logical_replication_slot('my_log_slot', 'pgoutput');
  1. Base backup with pg_basebackup:
Bash
pg_basebackup -h primary -D /var/lib/postgresql/14/main -P -X stream
  1. C extension basic example (see earlier minimal code). Example control file for extension:
Plain Text
comment = 'Hello world extension' default_version = '1.0' module_pathname = '$libdir/hello_world'
  1. Inspect pages with pageinspect:
SQL
CREATE EXTENSION pageinspect; SELECT * FROM heap_page_items(get_raw_page('my_table', 0));
  1. Dump WAL:
Bash
pg_waldump /var/lib/postgresql/14/main/pg_wal/00000001000000000000000A
  1. Logical decoding with wal2json (sample use):
Bash
pg_recvlogical -d mydb -S myslot --start -f - # will print JSON events as they occur

Conclusion

PostgreSQL internals are rich and carefully engineered. They balance correctness and performance through mechanisms like MVCC, WAL, a robust planner and executor, a range of index types, and an extensible architecture. Understanding these internals helps DBAs and developers tune systems, diagnose problems, and build powerful extensions.

For further reading, practical experimentation is encouraged: use EXPLAIN ANALYZE, pg_stat views, pg_waldump, and source code exploration. The PostgreSQL community and documentation provide detailed references, design notes, and mailing lists for deeper exploration.


  • The PostgreSQL Documentation (Storage, WAL, MVCC, Planner/Executor chapters)
  • PostgreSQL source code (src/backend)
  • PostgreSQL Internals SIG and developer docs
  • Books:
    • "The Internals of PostgreSQL" (H. K. Vandersypen / various authors online)
    • "PostgreSQL: Up and Running" and "Mastering PostgreSQL"
  • Research papers from the POSTGRES group (Stonebraker) and MVCC serializability papers (SSI)
  • Mailing lists and commit history for new features

If you want, I can:

  • Walk through a specific area in even more detail (e.g., WAL format and record types, XLogInsert internals).
  • Provide a guided walkthrough of optimizing a problematic query with EXPLAIN ANALYZE and tuning suggestions.
  • Show a complete C extension example, including Makefile and control file.