Postagres Internals

May 12, 2026··

21 min read

PostgreSQL Internals — A Deep Dive

This article is a comprehensive, in-depth examination of PostgreSQL internals. It covers history, architecture, storage, transaction and concurrency control, indexing, query planning and execution, background processes, durability and recovery, replication, extensions, observability, performance tuning, and future directions. Where useful, SQL and C code snippets illustrate how to inspect and extend the server.

Table of contents

Introduction and brief history
High-level architecture
Storage layout: files, pages, tuples, TOAST
MVCC, tuple visibility, and transaction IDs
Locking, isolation levels, and Serializable Snapshot Isolation (SSI)
Write-Ahead Logging (WAL), checkpoints, recovery, and PITR
Buffer management and caching
Vacuum, autovacuum, bloat, and wraparound
Query planner and optimizer internals
Executor and runtime components
Index types and internals (B-tree, GIN, GiST, BRIN, SP-GiST)
Replication and high-availability (physical and logical)
Extensions, hooks, and background workers
Observability, statistics, and diagnostic tools
Performance tuning: key GUCs and practical knobs
Tools, debugging and reading WAL
Current state and future directions
Practical examples and code snippets
Conclusion and recommended reading

Introduction and brief history

PostgreSQL is a mature, open-source relational database system descended from the POSTGRES project at UC Berkeley (Stonebraker et al.). It has supported advanced features like extensibility, complex SQL, and MVCC for decades. PostgreSQL’s modern architecture evolved to emphasize correctness, extensibility, and a stable interface for extensions and storage.

Key historical milestones:

POSTGRES (1986–1994): object-relational research.
PostgreSQL 6.x–8.x: stabilization and SQL standard compliance.
PostgreSQL 9.x: streaming replication, hot standby, and logical decoding.
PostgreSQL 10+ : native logical replication, declarative partitioning improvements.
Recent: JIT compilation for expressions (libgccjit), parallel query improvements, improved partitioning, incremental improvements to replication and monitoring.

High-level architecture

PostgreSQL uses a multi-process architecture: each client connection is a separate OS process (a server process). There is a "postmaster" (now just the "postgres" parent process) that accepts connections and forks worker processes.

Key process types:

Backend processes: one per client connection.
Background worker processes:
- checkpointer (checkpointer)
- background writer (bgwriter)
- wal writer (walwriter)
- wal sender / wal receiver (replication)
- autovacuum launcher and workers
- stats collector
- logical replication workers
- archiver (optional)
Shared memory (Postgres shared buffers, locks, etc.)
Semaphores, spinlocks, lightweight locks (LWLocks)

Inter-process communication is via shared memory, semaphores, and signals. The server uses POSIX primitives where available.

Main components (logical):

Storage manager (heap, files, forks)
Buffer manager (shared_buffers)
WAL subsystem (XLOG)
Lock manager (lightweight locks for data structures; heavyweight locks for DDL)
Transaction manager (XID handling, subtransactions)
Query planner and optimizer
Executor
Logical replication & logical decoding

Storage layout: files, pages, tuples, TOAST

Physical storage:

Data stored in the cluster's data directory under base/ (default tablespaces), pg_tblspc (tablespaces), global/, pg_wal (WAL).
Each relation (table/index) is represented by one or more files (1GB segment limit). Several forks per relation:
- main (heap/index)
- fsm (free space map) — legacy in older versions, now replaced by FSM management in shared buffers? (Note: older versions used filesystem fsm fork)
- vm (visibility map)
- init (initial content)
Files are in a binary layout of pages (8 KB by default) called blocks.

Page format:

Page header
Item ID (line pointer) array
Tuple data area (grows from end toward the line pointer array)
Special space (index-specific)
Checksum (optional)

Heap tuple header includes:

t_xmin (inserting transaction id), t_xmax (deleting/locking TX)
t_cid (command id)
t_infomask bits (visibility flags, hint bits)
t_ctid (pointer to physical row version)
optional frozen/xmin/xmax markers

TOAST

"The Oversized-Attribute Storage Technique"; large attributes (TOAST-able columns like large text or bytea) are stored out-of-line in a TOAST table with compression and chunking. This keeps main heap pages smaller and reduces I/O for common operations.

Visibility map

A map recording pages that are all-visible (no tuples require visibility checks). Helps vacuum and index-only scans.

MVCC, tuple visibility, and transaction IDs

PostgreSQL implements MVCC by storing multiple versions of tuples (row versions) in-place (updates create new tuple versions in the same relation file). There is no centralized version store or delta; instead each tuple carries xmin/xmax.

Visibility logic

A backend uses a snapshot to decide which tuples are visible:
- tuples with xmin committed before snapshot and not deleted before snapshot are visible
- tuples with xmax set to a transaction committed before snapshot are invisible
Snapshot contains information: xmin, xmax, and an array of in-progress XIDs.

Transaction IDs (XIDs)

32-bit wraparound: requires freezing of old rows via VACUUM to prevent incorrect visibility after wraparound.
Frozen transactions mark tuples as effectively immortal (use FrozenXID).
autovacuum/vacuum freeze aims to prevent XID wraparound.

Hint bits

To avoid consulting WAL on visibility repeatedly, backends set hint bits in the tuple header once a tuple status (committed/aborted) is known. These hint bits are non-logged (but can be set using WAL if WAL-logged hint bit updates are disabled? Typically hint bit changes are not WAL-logged; but their presence is safe because they encode info that can be recomputed).

CTID

After updates, the old row remains; the new version gets a new physical location. CTID refers to the physical location allowing e.g. internal navigation or updates by physical tuple reference.

Subtransactions (savepoints)

Implemented using sub-XIDs. When a subtransaction commits, its XIDs are converted into a top-level XID, with translation arrays.

Snapshot types

MVCC snapshot (current/older view)
Historic snapshots for logical decoding or serializable transactions.

Locking, isolation levels, and Serializable Snapshot Isolation (SSI)

Lock manager

Lightweight locks (LWLocks): protect buffer control structures and data structures inside PostgreSQL.
Heavyweight locks: relation-level locks (AccessShareLock, AccessExclusiveLock, etc.) used for DDL and concurrency control.
Tuple-level locking: Use of xmin/xmax and lock tuples via updating xmax with special lock parameters.

Isolation levels

Read Uncommitted: maps to Read Committed behavior (Postgres does not support dirty reads).
Read Committed: each statement sees a snapshot as of the start of the statement; new snapshot per statement.
Repeatable Read: fixed snapshot for the whole transaction (since Postgres 9.1, Repeatable Read used to be serializable? Now Postgres implements Serializable as SSI).
Serializable: implemented by Serializable Snapshot Isolation (SSI), which detects dangerous structures and aborts transactions that would violate serializability.

Serializable Snapshot Isolation (SSI)

Allows snapshot isolation semantics but detects write-read or read-write conflicts forming dangerous structures and aborts one of the transactions.
Postgres tracks read/write dependencies with predicate locking utilities and conflicts to maintain serializability.

Predicate locks and index-only locking

Postgres uses predicate locks logically for range queries via index-based predicate checks (e.g., during range scans) to detect phantoms under Serializable isolation. Implementation uses SIREAD locks (no physical lock is attached to tuples, it's recorded in a separate structure).

Two-phase commit

Prepare and commit for distributed transactions with external resource managers.

Lock monitoring

Views: pg_locks, pg_stat_activity, pg_stat_activity shows waiting/blocked states.

Write-Ahead Logging (WAL), checkpoints, recovery, and PITR

WAL fundamentals

WAL records every change needed to reconstruct data pages; writes WAL to disk before pages are flushed (WAL-first principle). This enables crash recovery.
WAL files live in pg_wal/ (or pg_xlog/ pre-9.5). WAL segments are 16 MB by default.

WAL insert and flush

XLogInsert builds a WAL record. WAL writer flushes groups of WAL records to disk (synchronous or asynchronous depending on commit settings).
Commit record ensures durability of transaction.

Checkpoints

Checkpointer scans buffers and writes dirty buffers to disk to make WAL replay shorter. Checkpoints write checkpoint WAL records and update pg_control.
Frequent checkpoints reduce recovery time but increase I/O.

Recovery and PITR

pg_basebackup or file copy for base backup.
Archive WAL segments (archive_command) for point-in-time recovery.
Recovery applies WAL records from base backup forward to reconstruct consistent state.
Timelines track branches from recovery.

Synchronous and asynchronous commit

synchronous_commit controls whether a transaction waits for WAL and/or commit to be flushed.
fsync and full_page_writes settings influence durability and performance.

Logical decoding (logical WAL)

WAL can be decoded into logical changes for replication or auditing using logical decoding plugins (wal2json, pgoutput). Logical decoding relies on consistent snapshots and retention of WAL via replication slots.

WAL configuration keys

wal_level (minimal, replica, logical)
max_wal_senders
wal_keep_size
max_replication_slots
checkpoint_timeout, checkpoint_completion_target
max_wal_size, min_wal_size
synchronous_commit

WAL and parallelism

WAL must serialize some changes, but many types of queries and index builds can be parallelized (with careful WAL behavior).

WAL archiving and streaming replication

WAL segments can be archived and replayed by standbys; replication can stream WAL.

Tools: pg_waldump, pg_xlogdump (older), pg_receivewal, logical decoding plugins.

Buffer management and caching

Shared buffers

shared_buffers holds cached disk pages (8 KB each).
Buffers are protected by buffer headers and lightweight locks (buf_hdr lock) and managed by a replacement strategy.

Replacement strategy

Historically a clock sweep algorithm with improvements. Buffer pins prevent eviction of pages in use.

Checkpoint and bgwriter

Background writer proactively writes dirty buffers to reduce checkpoint I/O spikes.
Checkpointer writes remaining dirty buffers during checkpoint.

OS cache

Postgres relies on OS page cache; shared_buffers is not the entire cache. There is interplay between shared_buffers and OS cache; recommended sizing depends on workload.

Buffer lifecycle considerations

Read: buffer read into shared_buffers -> pinned -> used -> unpinned.
Write: modification marks buffer dirty and WAL is emitted for correct page updates.

Buffer management internals

Buffer descriptor: keeps flags, refcount, lock, tag (block id), usage_count for eviction.

Vacuum, autovacuum, bloat, and wraparound

VACUUM

Reclaims space from dead tuples and makes tuples visible for reuse.
Vacuum updates visibility map and may set hint bits and freeze tuples.
VACUUM FULL rewrites the table, reclaiming physical disk space (exclusive lock).

Autovacuum

Background autovacuum workers run automatically to clean tables and prevent wraparound.
Autovacuum settings include autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, autovacuum_analyze_scale_factor, autovacuum_max_workers, autovacuum_naptime, autovacuum_freeze_max_age.

Bloat

Dead tuple accumulation increases table size and slows scans.
Regular autovacuum tuning and aggressive VACUUMs in high-update workloads reduce bloat.
Partitioning and HOT updates reduce bloat.

HOT (Heap-Only Tuple) updates

Update that creates a new tuple in the same page without updating indexes when index columns are unchanged, avoiding index maintenance and reducing WAL.
HOT chain: multiple HOT updates can chain, but can complicate vacuum.

Wraparound protection

Because XIDs are 32-bit, Postgres must freeze tuples older than a threshold; autovacuum+vacuum freeze handle this. If not handled, Postgres will force a shutdown to protect data.

Visibility map and Index-only scans

The visibility map flags entire pages as all-visible so index-only scans can be used without visiting heap pages.

Query planner and optimizer internals

Overview

Planner transforms SQL parse trees into query plans with nodes (SeqScan, IndexScan, HashJoin, MergeJoin, NestLoop, etc.). It estimates costs using a cost model.
Heuristics and dynamic programming create join orders and select join algorithms.

Statistics

PostgreSQL keeps per-column statistics: n_distinct, most_common_vals, histogram bounds, correlation, etc.
Collect statistics via ANALYZE; parameters like default_statistics_target, ALTER TABLE ... ALTER COLUMN SET STATISTICS.
Extended statistics: multi-column statistics (e.g., for correlated columns) to improve cardinality estimation.

Cost model

Cost parameters include:
- cpu_tuple_cost
- cpu_index_tuple_cost
- random_page_cost
- seq_page_cost
- effective_cache_size (planner assumption about cached data)
The planner uses estimated rows × per-row cost + startup costs.

Join planning

Planner considers join orderings using dynamic search (for smaller join numbers) or genetic-like heuristics for larger joins (geqo).
Join algorithms:
- Nested Loop: good for small inner or indexed inner scans.
- Hash Join: build hash table on join key for inner, then probe.
- Merge Join: requires sorted inputs (good when both sides are ordered).

Selectivity estimation

Planner estimates selectivity of predicates: equality, range, IN, LIKE, etc.
Mismatch between real distribution and statistics leads to cardinality estimation errors and bad plans.

Plan nodes and instrumentation

EXPLAIN ANALYZE shows planner’s chosen plan and runtime timing.

JIT (Just-In-Time) compilation

For expression evaluation and tuple deforming, Postgres can use LLVM (libgccjit or llvm) to generate machine code at planning time (since PostgreSQL 11-ish). Useful on heavy CPU workloads.

Plan caching

Prepared statements, PL/pgSQL functions, and generic plan caching; generic plans may be used for prepared statements with parameters — the planner may choose a generic or custom plan based on usage statistics.

Parallel planning/execution

PostgreSQL supports parallel sequential scans, parallel joins, and parallel aggregate with workers. There are constraints (e.g., not all functions are parallel-safe).

Planner hooks

Extension points exist to influence planning (planner hooks) for custom behavior.

Executor and runtime components

Executor responsibilities

Traverse plan tree to execute operations and produce tuples.
Executor nodes implement the logic for each node type and maintain state.

Tuple slot mechanism

Tuples are passed between nodes via TupleTableSlot, allowing abstraction over physical tuple types.

Memory contexts

PostgreSQL uses a hierarchical memory context allocator to manage allocations with automatic deallocation at context reset. Typical contexts: per-query, per-transaction, per-executor node.

Tuplesort and tuplestore

For sorts and materialization, PostgreSQL uses Tuplesort (external sort) and Tuplestore (write to disk if needed).
Work memory (work_mem) affects the threshold to spill to disk.

Expression evaluation

ExecEval* functions evaluate expressions; expression trees are compiled into executable structures at plan time.
Function calls use fmgr interface and can be marked parallel-safe / leak-proof etc.

Triggers and constraints

Triggers and constraints are enforced by executor at the right points (BEFORE, AFTER).
Foreign key actions may require extra locking and checking.

SPI (Server Programming Interface)

Internal API for C-language functions to run SQL statements in server context.

Executor instrumentation

Execution statistics can be exposed via EXPLAIN ANALYZE and pg_stat_statements.

Index types and internals

PostgreSQL supports multiple index types, each with different tradeoffs.

B-tree (default)

Balanced tree of pages; supports equality and range scans.
Split and deletion handling; write-ahead-logged operations.
Page-level structure with leaf pages, internal pages, and root.
Concurrency via WAL and locks; VACUUM cleans dead index entries (POSTGRES has index vacuum to remove entries pointing to dead tuples).
MVCC complicates index maintenance: index entries point to heaps that may be dead.

GIN (Generalized Inverted Index)

Good for indexing multi-valued items (e.g., array elements, full text).
Has posting lists and uses a posting tree for large lists.
Supports fast retrieval of matching items with OR/AND semantics.
More complex WAL behavior and VACUUM.

GiST (Generalized Search Tree)

A framework for tree-based indexes like R-tree, KNN on geometric data, etc.
Supports custom consistent functions and penalties for split heuristics.
Not as strongly balanced as B-tree; relies on recheck.

BRIN (Block Range Index)

Block-range index storing summaries per range (min/max).
Extremely small index size but useful for naturally ordered data (e.g., time).
Cheap to maintain, low I/O benefit when data correlated.

SP-GiST (Space-partitioned GiST)

Space-partitioned tree, efficient for some data distributions.

Index-only scans

Use visibility map to avoid heap fetch if page marked all-visible.

Concurrent index builds

Create index CONCURRENTLY to avoid exclusive locks; slower but allows reads/writes during creation.

Index maintenance

Insert/update: index entries appended; updates might require deletion of old index entries and insertion of new ones.
Index-only scans and vacuuming interact with visibility and bloat.

Replication and high-availability (physical and logical)

Physical replication (streaming)

WAL-based streaming replication: standby continuously receives and applies WAL records.
Synchronous replication: commit on primary waits for standby confirmation (zero data loss when configured).
Hot standby: standby can accept read-only queries while applying WAL.

Physical base backup

pg_basebackup or file-based copy while WAL archiving ensures consistent base backups.

Replication features & config

wal_level (replica, logical)
max_wal_senders
primary_conninfo (on standby)
recovery.signal/recovery.conf (older versions)

Logical replication

Publishes changes at logical level via WAL decoding and applies them to subscribers via logical replication protocol.
row-level replication by default; can replicate selected tables.
Replication slots: keep WAL required for logical decoding from being removed.

Logical decoding

Plugins (pgoutput, wal2json) decode WAL into logical changes.
Requires wal_level=logical and appropriate snapshot handling.

Cascading replication and multi-primary

Cascading standby can stream from downstream standby.
BDR and other tools provide multi-master synchronization (third-party or external extensions).

Conflict resolution

Logical replication is sensitive to conflicts; subscribers may need conflict detection/resolution logic.

Failover and switchover

Tools: repmgr, Patroni, Pacemaker, pg_auto_failover for orchestrating failover/switchover.

Extensions, hooks, and background workers

Extension architecture

Extensions can be packaged with control files and SQL script files and installed via CREATE EXTENSION.
PGXS lets C extensions be compiled and distributed.

Loadable modules

.so libraries loaded by shared_preload_libraries or via CREATE FUNCTION ... AS 'library'.

Server programming interfaces

API to build functions, index access methods, background workers, custom scan nodes, planner hooks, executor hooks, utility hooks.

Background workers

Extensions can register background workers using bgworker API to run persistent tasks (e.g., logical replication connectors, monitoring agents).

Key extension examples

pg_stat_statements: tracks statement execution statistics.
PostGIS: adds geospatial types, operators, and index support (GiST, GIN).
citus: sharding and distributed query processing (needs both DB and extension).
pgaudit: auditing.

Security considerations

Extensions loaded into server process must be trusted or controlled since they execute native code.

Example skeleton for C extension (simplified):

Plain Text

#include "postgres.h"
#include "fmgr.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(hello_world);

Datum
hello_world(PG_FUNCTION_ARGS)
{
    text *t = cstring_to_text("Hello, PostgreSQL extension world!");
    PG_RETURN_TEXT_P(t);
}

Build with PGXS and install via CREATE FUNCTION linking to the shared library.

Observability, statistics, and diagnostic tools

System views and functions:

pg_stat_activity — current sessions and queries
pg_stat_database — database-level statistics (xact_commit, xact_rollback, blks_read, blks_hit)
pg_stat_statements (extension) — per-statement cumulative stats
pg_locks — current locks
pg_statio_* — I/O statistics for tables and indices
pg_stat_user_tables, pg_stat_user_indexes — per-object statistics
pg_stat_bgwriter — bgwriter/checkpoint statistics
pg_stat_replication — replication sender info
pg_visibility, pageinspect extension — inspect page contents

Logging

postgresql.conf: logging_collector, log_min_duration_statement, log_statement, log_line_prefix
log_parser and other external tools parse logs for analysis.

Profilers and performance tools

EXPLAIN ANALYZE — runtime execution details and per-node timing and rows.
auto_explain extension — log slow plans automatically.
Perf, eBPF, DTrace (where supported) for OS-level profiling.
pgbadger — parses logs to create reports.
pgFouine, pganalyze, pgmetrics, pg_stat_monitor (alternative to pg_stat_statements).

Inspecting internal state

pg_waldump — inspect WAL segments
pg_xlogdump (older)
pg_basebackup for base backups
pg_controldata for cluster control information

Performance tuning: key GUCs and practical knobs

Memory

shared_buffers: memory for shared buffer pool (commonly 25% of RAM, but depends on workload)
effective_cache_size: planner estimate of available OS cache
work_mem: per-sort/operation memory; controls whether sorts/hash build spill to disk
maintenance_work_mem: used for VACUUM, CREATE INDEX, etc.

WAL and durability

checkpoint_timeout, checkpoint_completion_target
wal_buffers: size for WAL insertion
max_wal_size and min_wal_size: control checkpoint frequency (modern knobs to reduce checkpoint storms)
synchronous_commit: controls how strict durability is (on/off/local)
commit_delay/commit_siblings: low-level tuning for commit grouping

Autovacuum

autovacuum_vacuum_threshold/scale, autovacuum_max_workers, autovacuum_vacuum_cost_limit, autovacuum_vacuum_cost_delay.

I/O and disk

random_page_cost/seq_page_cost: affects planner cost preferences
use of SSDs reduces random_page_cost significantly
effective_io_concurrency (on some platforms) for parallel I/O during index builds and vacuum

Connections

max_connections and connection pooling: consider PgBouncer/ PgPool to mitigate process-per-connection overhead.

CPU and parallelism

max_worker_processes
max_parallel_workers_per_gather
max_parallel_workers
parallel_setup_cost, parallel_tuple_cost

Index usage and maintenance

consider partial indexes, psql ANALYZE frequency, targeted indexes (covering + index-only scans), expressional indexes.

Practical tuning approach

Start with baseline (reasonable shared_buffers, autovacuum on)
Address queries with EXPLAIN ANALYZE
Tune planner parameters (effective_cache_size and random_page_cost) only after measuring
Use connection pooling for high-concurrency workloads
Monitor autovacuum, bloat, and tune vacuum cost settings when autovacuum is too slow.

Tools, debugging and reading WAL

Reading WAL

pg_waldump: read WAL segments in human-friendly form.
pg_recvlogical / logical decoding plugins for logical replication output.
pg_xlogdump (older) provides similar inspection.

Inspecting pages

pageinspect extension to view raw page contents, tuples, and tuples status.

Disk debugging

psql queries:

SQL

-- Check current WAL write status
SELECT * FROM pg_stat_bgwriter;

-- Check locks and blocking
SELECT blocked_locks.pid     AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid    AS blocking_pid,
       blocking_activity.usename AS blocking_user
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
  AND blocking_locks.granted;

Analyzing queries

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to capture planner vs executor behavior.

WAL archiving test

Configure archive_command to copy WAL files to a safe location and test recovery with pg_restore and recovery.conf or recovery.signal.

Crash simulation

Use abrupt kill of server process and check WAL-based crash recovery upon restart.

Profiling

use perf, eBPF traces, or lttng to get OS-level timelines.

Current state and future directions

State as of 2024 (broad strokes)

Continual improvements in parallel query, logical replication, partitioning, and JIT provide better throughput.
Indexing and storage optimization (BRIN improvements, bloom indexes) support large analytical datasets.
Increasing focus on observability and cloud-native replication tooling (pglogical, built-in logical replication).
Performance improvements and features like incremental sorting, deduplication, and optimized memory management are ongoing.

Research/future directions and experiments

Alternatives to heap such as ZHeap (research/experimental) to reduce bloat and improve visibility management.
Improved cardinality estimation using better statistics, extended histograms, or machine learning approaches.
Transparent sharding and distributed SQL (e.g., Citus) pushing PG internals towards distributed cluster-awareness.
More robust multi-master replication alternatives (BDR, decentralized CRDT-based approaches).
Further JIT and LLVM optimization adoption and pervasively faster expression evaluation.
Row-level security and more fine-grained enforcement.

Practical examples and code snippets

Query patterns to inspect internals:

Check active queries and blocking:

SQL

SELECT pid, usename, state, query_start, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state <> 'idle';

Check lock statuses:

SQL

SELECT pid, mode, granted, relation::regclass
FROM pg_locks l LEFT JOIN pg_class c ON l.relation = c.oid
WHERE database = (SELECT oid FROM pg_database WHERE datname = current_database())
ORDER BY pid;

Observe WAL and checkpoint information:

SQL

SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;

Use EXPLAIN ANALYZE with buffers:

SQL

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM orders WHERE order_date BETWEEN '2023-01-01' AND '2023-01-31';

Create and use a replication slot:

SQL

-- on primary
SELECT * FROM pg_create_physical_replication_slot('my_slot');

-- or for logical
SELECT * FROM pg_create_logical_replication_slot('my_log_slot', 'pgoutput');

Base backup with pg_basebackup:

Bash

pg_basebackup -h primary -D /var/lib/postgresql/14/main -P -X stream

C extension basic example (see earlier minimal code). Example control file for extension:

Plain Text

comment = 'Hello world extension'
default_version = '1.0'
module_pathname = '$libdir/hello_world'

Inspect pages with pageinspect:

SQL

CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('my_table', 0));

Dump WAL:

Bash

pg_waldump /var/lib/postgresql/14/main/pg_wal/00000001000000000000000A

Logical decoding with wal2json (sample use):

Bash

pg_recvlogical -d mydb -S myslot --start -f -
# will print JSON events as they occur

Conclusion

PostgreSQL internals are rich and carefully engineered. They balance correctness and performance through mechanisms like MVCC, WAL, a robust planner and executor, a range of index types, and an extensible architecture. Understanding these internals helps DBAs and developers tune systems, diagnose problems, and build powerful extensions.

For further reading, practical experimentation is encouraged: use EXPLAIN ANALYZE, pg_stat views, pg_waldump, and source code exploration. The PostgreSQL community and documentation provide detailed references, design notes, and mailing lists for deeper exploration.

PostgreSQL Internals — A Deep Dive

Introduction and brief history

High-level architecture

Storage layout: files, pages, tuples, TOAST

MVCC, tuple visibility, and transaction IDs

Locking, isolation levels, and Serializable Snapshot Isolation (SSI)

Write-Ahead Logging (WAL), checkpoints, recovery, and PITR

Buffer management and caching

Vacuum, autovacuum, bloat, and wraparound

Query planner and optimizer internals

Executor and runtime components

Index types and internals

Replication and high-availability (physical and logical)

Extensions, hooks, and background workers

Observability, statistics, and diagnostic tools

Performance tuning: key GUCs and practical knobs

Tools, debugging and reading WAL

Current state and future directions

Practical examples and code snippets

Conclusion

Recommended reading and references