A learning path ready to make your own.

B-tree indexes

B-tree Indexes — A Deep Dive (Summary) This summary outlines the essentials of B-tree indexes: their purpose, structure, core algorithms, practical considerations for disk- and memory-oriented database systems, concurrency and recovery concerns, common implementations, alternatives, optimizations, tuning guidance, and future directions. What B-trees provide Ordered, balanced, block-oriented indexing designed to minimize I/O by packing many keys per page (high fanout). Support for exact lookups, range scans, ordered traversal (ORDER BY), and efficient updates (local splits/merges, logarithmic height). Typical tree height is very small (4–6 levels) for large datasets because fanout is in the hundreds for typical page/key sizes. History & variants Introduced by Bayer & McCreight (1972). Major practical variants: B+ tree — common DBMS form: internal nodes hold keys only; records live in linked leaves. B* tree — aims for higher minimum fill (≈2/3) via sibling redistribution. B-link tree — adds right-sibling pointers/high-keys to improve concurrency/split safety. Structure, invariants & node layout Order m: node ≤ m children, internal nodes (except root) ≥ ⌈m/2⌉ children; all leaves same depth; keys sorted inside nodes. B+ specifics: all records in leaves, leaves linked for fast scans. Typical page layout: header (type, counts, sibling/parent pointers), key array (often prefix-compressed), pointer array (children or record pointers), slot directory for variable-length entries, optional checksums/LSN. Core algorithms (high level) Search: traverse from root, binary-search keys in each node to pick child, end at leaf to find key or range start. I/O ≈ tree height (often cached). Insert: insert into leaf; if overflow, split leaf and promote separator; propagate splits up, possibly creating new root. Splits are local and typically one new page write + parent update. Delete: remove from leaf; if underfull, try redistribute from sibling or merge and propagate parent adjustments; root may shrink. Within-node search is O(log m) (binary search) or linear for small nodes; overall lookup cost O(h · log m) but dominated by page I/O. Complexity & I/O cost Fanout m ≈ floor(B / (K+P)), where B = page size, K = avg key size, P = pointer size. Large m keeps height h ≈ log_m n small. Point lookup: O(h) page reads (often 1–5 worst-case). Range scan of k entries: O(h + k/(entries per leaf page)). Space: minimum occupancy ≈ 50% (66% for B*), bounding storage overhead. Disk/memory layout & optimizations Align nodes with disk pages (4K/8K/16K). Larger pages increase fanout but enlarge write units. Prefix compression in internal nodes to increase fanout; leaf compression for space savings when keys share prefixes. Variable-length keys use slot directories; very large values are stored out-of-line (overflow/TOAST). Buffer pool and caching of upper levels dramatically reduce average I/O. Practical database aspects Clustered (primary) index: table rows stored in leaf pages ordered by key (only one per table). Excellent for range scans on the clustering key. Secondary indexes: store key + pointer to row (or to clustered key); non-covering queries may require an extra lookup. Compound indexes are lexicographic; usefulness depends on leading columns. Covering indexes avoid table fetches. Bulk-loading: building leaves from sorted input (bottom-up) avoids many splits and is faster than many individual inserts. Fill factor controls initial leaf fullness to reduce splits; periodic reindexing/vacuuming removes fragmentation and dead entries. Concurrency & recovery Concurrency strategies: lock coupling (hand-over-hand), latches for short-term safety, optimistic (validate-and-retry) traversals, and latch-free variants. B-link trees and sibling pointers allow more localized locking and safer split behavior under concurrency. MVCC interactions (e.g., PostgreSQL HOT updates) reduce index churn by avoiding unnecessary index updates for certain updates; vacuuming reclaims dead index entries. Durability via WAL: structural changes (splits/merges) are logged and ordered so recovery can reestablish consistency. Representative implementations PostgreSQL: optimized B+-tree, MVCC-aware, HOT updates, vacuuming, prefix compression options. InnoDB (MySQL): clustered B+ tree (default primary), 16KB pages, adaptive hashing. SQLite: file-format B-tree storing tables and indexes. Oracle and modern filesystems (Btrfs, XFS extent trees) use B-tree variants for metadata/extent management. B-tree vs alternatives (trade-offs) Hash indexes: O(1) equality lookups but no ordering/ranges. LSM-trees: superior write throughput (append/compaction model) but higher read amplification, less efficient range scans, compaction overhead. In-memory trees (ART, radix): cache-friendly and fast in RAM but not optimized for block storage. Learned indexes: promising smaller structures by modeling key distribution but still experimental for dynamic/update-heavy workloads. Choose B-tree when range queries, low-latency reads, and mixed read/write workloads dominate; LSM when write throughput and append-friendly workloads dominate. Advanced optimizations Prefix/truncation compression, sibling pointers (B+), B-link enhancements, variable-length key management. Hot-spot mitigation for monotonically increasing keys: partitioning, randomized keys, or multi-root/hash sharding. Cache-aware layout, prefetching, and SIMD-accelerated searches for within-node speedups. Future trends Persistent memory (NVM/PMEM) enables byte-addressable durable structures and new concurrency/crash-consistency designs. Hybrid/learned index ideas may reduce space and probes but require robust update/repair strategies. Hardware shifts (NVMe, SSDs) and cloud/distributed databases influence page-size, compaction, and replication strategies. Tuning checklist (practical recommendations) Prefer B-trees for range-heavy and mixed workloads; consider LSM for write-heavy, append-dominant workloads. Choose page size to suit storage and workload; keep upper levels cached. Set fillfactor to balance split frequency vs space usage when heavy inserts follow index creation. Avoid unpartitioned monotonically increasing keys under high concurrent inserts; use partitioning or other mitigations. Use covering indexes where possible to avoid extra lookups; bulk-load sorted data for large imports. Monitor bloat and rebuild/reindex as needed; benchmark compression trade-offs for CPU vs I/O savings. Conclusion B-tree (especially B+ tree) indexes are a mature, versatile choice for ordered indexing on block storage, offering predictable, low-height search costs, efficient range scans, and local updates. They remain the default in many RDBMS and filesystems due to their balance of simplicity, robustness, and support for mixed workloads. For deeper dives I can provide detailed delete pseudocode, an insertion simulation with I/O counts, workload-specific numerical comparisons with LSM, or an annotated implementation sketch in C/C++/Rust—tell me which you prefer.

Open full tree

Follow the trail that experts already trust.

Resources

39:41

10.2 B Trees and B+ Trees. How they are useful in Databases

Abdul Bari1.6M views

12:39

Understanding B-Trees: The Data Structure Behind Modern Databases

Spanning Tree1.0M views

14:16

DB Indexing in System Design Interviews - B-tree, Geospatial, Inverted Index, and more!

Hello Interview - SWE Interview Preparation200.4K views

3:57

Read deeper, connect wider, own the subject.

Deep Article

B-tree Indexes — A Deep Dive

B-tree indexes are among the foundational data structures of modern database systems, filesystems, and storage engines. They provide efficient, balanced, disk-oriented ordered indexing that supports exact-match lookups, ordered range scans, and efficient updates. This article gives a comprehensive treatment: history, structure and invariants, algorithmic details, practical implications for disk and memory, concurrency and recovery, real-world implementations, trade-offs versus alternatives (notably LSM-trees), tuning and deployment guidance, and future directions.

Table of contents

Introduction and motivation
Brief history and lineage
B-tree family: B-tree, B+ tree, B* tree, B-link tree
Structure, invariants, and node layout
Algorithms: search, insert (split), delete (merge/redistribute)
Complexity, fanout, and I/O cost analysis
Disk/memory-oriented considerations and storage layout
Practical aspects in databases
Primary vs secondary (clustered) indexes
Compound keys, covering indexes, partial and expression indexes
Bulk loading, fill factor and maintenance
Prefix/truncation compression, overflow handling
Concurrency, locking, and recovery
Lock coupling, optimistic concurrency, latch-free variants
MVCC interactions and HOT updates
Write-ahead logging (WAL) and crash recovery
Implementation examples in popular systems
PostgreSQL, InnoDB (MySQL), SQLite, Oracle, filesystems (Btrfs)
B-tree vs alternatives (hash, LSM, radix/ART, learned indexes)
Advanced topics and optimizations
Cache-awareness, prefix compression, key ordering
B+ tree sibling pointers, B-link trees
Variable-length keys, overflow pages
Future trends: NVM/PMEM, learned/hybrid indexes, hardware changes
Practical recommendations / tuning checklist
Example pseudocode and SQL examples
Further reading and seminal references

Introduction and motivation

When datasets exceed memory and must be stored on block-addressable storage (disks, SSDs), index structures must minimize I/O. B-tree indexes are balanced trees designed so each node matches a block (page) on disk. By maximizing fanout (number of children per internal node), they keep tree height small, reducing the number of I/Os per lookup. They support:

Exact lookup (equality)
Range queries (scans)
Ordered traversal (ORDER BY)
Efficient insert/delete/updates (logarithmic height; local splits/merges)

Because the structure stores keys in sorted order and groups many keys per node, B-tree indexes are particularly good for range queries and for workloads requiring low read latency.

Brief history and lineage

1970s: B-trees were introduced by Rudolf Bayer and Edward McCreight in 1972 as a balanced tree optimized for secondary storage (Bayer & McCreight, "Organization and Maintenance of Large Ordered Indexes").
Variants emerged:
B+ tree: stores all data/values in leaves linked sequentially; internal nodes store only keys. Widely used in DBMS and filesystems.
B* tree: variant designed to keep nodes fuller by redistributing keys among siblings before splitting.
B-link tree (Lehman and Yao, 1981): adds right-sibling pointers and high keys to enable lock-free/less-blocking concurrency during splits.

Since their inception, B-tree families have become the default indexing structure in most relational databases, key-value stores, and many filesystems.

B-tree family: B-tree, B+ tree, B* tree, B-link tree

B-tree (general): internal nodes store keys and child pointers; keys separate child subranges. Leaves may also store records/pointers.
B+ tree: all actual records (or pointers to records) are stored in leaves. Internal nodes contain only keys and child pointers. Leaves are often doubly or singly linked to support fast ordered scans. This is the most common form for database indexes.
B* tree: an optimization that tries to keep nodes at least 2/3 full instead of 1/2; on insert, siblings are used to redistribute before splitting.
B-link tree: a B+ tree with right sibling pointers at every level and additional high-key info to make concurrent splits safe without global coordination; used in concurrent implementations.

Key practical note: when people say "B-tree index" in modern DBMS, they most often mean a B+ tree implementation.

Structure, invariants, and node layout

Basic B-tree invariants (B-tree of order m):

Each node can have at most m children (i.e., up to m-1 keys).
Each internal node (except root) must have at least ⌈m/2⌉ children.
All leaves are at the same depth (tree is height-balanced).
Keys within a node are sorted.

For B+ tree specifics:

Internal nodes contain only separators and child pointers.
Leaf nodes contain key-pointer pairs (pointers to rows or the row itself in clustered indexes).
Leaf nodes are linked sequentially for efficient range scans.

Node layout commonly includes:

Header (node type, item count, pointers to parent/left/right sibling)
Array of keys (or key prefixes)
Array of pointers (child pointers in internals, record pointers in leaves)
Optional free space and slot directory (for variable-length keys)
Checksums/LSNs for recovery

For disk-oriented layout, nodes usually align with page sizes (e.g., 4 KB, 8 KB) to allow efficient I/O.

Algorithms

B-tree operations are straightforward conceptually but have practical subtleties for disk storage, concurrency, and variable-length keys. Below are the core algorithms in brief, followed by annotated pseudocode.

Search (lookup)

Start at root.
At each internal node, binary-search (or linear scan for small node) the key array to find the child pointer whose subrange contains the search key.
Move to that child and repeat until a leaf node is reached.
In leaf: search keys for equality (or locate starting position for range scan).

I/O cost ~= number of levels visited (one page read per visited node if not cached).

Insert (high level)

Find the leaf where the key should go.
Insert the key into the leaf in sorted order.
If the leaf overflows (exceeds max entries), split it into two nodes:
For B+ tree, promote the smallest key of the new right node to the parent as the separator.
If parent overflows due to new separator, recursively split parent.
If root splits, create a new root and increase tree height by 1.

Splits are local operations typically involving one page read/write and one new page write.

Delete (high level)

Find leaf containing key and remove it.
If the leaf becomes underfull (fewer entries than min), try to borrow (redistribute) an entry from a sibling.
If redistribution is not possible (sibling at minimum), merge with sibling and remove separator from parent.
If parent underflows, recursively borrow/merge upward. If root ends up with a single child, shrink root (height--).

Pseudocode (simplified)

Search pseudocode: ``text function BTreeSearch(node, key): if node.isLeaf: return node.findKey(key) // return pointer or NOT_FOUND else: i = node.findChildIndexForKey(key) child = node.childPointers[i] return BTreeSearch(child, key) ``

Insert pseudocode (simplified, non-concurrent, B+ semantics): ```text function BTreeInsert(tree, key, value): leaf = findLeaf(tree.root, key) insertIntoLeaf(leaf, key, value) if leaf.isOverflow(): newLeaf, separator = splitLeaf(leaf) insertIntoParent(leaf.parent, separator, leaf, newLeaf)

function insertIntoParent(parent, key, leftChild, rightChild): if parent == null: // create new root newRoot = Node(keys=[key], children=[leftChild, rightChild]) tree.root = newRoot return parent.insertKeyAndChild(key, rightChild) if parent.isOverflow(): newParent, separator = splitInternal(parent) insertIntoParent(parent.parent, separator, parent, newParent) ```

Delete pseudocode sketch omitted here for brevity because it requires many cases (borrow left/right, merge, propagate). Implementations must carefully manage sibling pointers and parent separators.

Complexity, fanout, and I/O cost analysis

Let:

B = page size (bytes)
P = pointer size (bytes) — often 8 bytes on 64-bit
K = average key size (bytes)
E = space per entry ≈ K + P (for internal nodes might be separator keys + child pointers)
m ≈ floor(B / E) ≈ approximate maximum children per internal node (fanout)

Height bound:

Each internal node has at least t = ceil(m/2) children (except root).
For n keys (or n records), approximate height h satisfies:

n ≤ m^h => h ≥ logm n (very loose) More precise lower bound for height: h ≤ floor(log{t} ((n + 1) / 2)) (when counting nodes vs keys; constants omitted)

Practical consequence: because m is large (hundreds), h remains small: Example:

B = 4096 bytes, P = 8 bytes, K = 16 bytes → E ≈ 24 → m ≈ 170.
For n = 1 billion (~1e9), h ≈ log_{170}(1e9) ≈ ln(1e9)/ln(170) ≈ 20.7/5.14 ≈ 4.0.

So only 4-5 levels are typical—meaning many lookups need only 4-5 random page reads in the worst case. With caching (root and upper internal nodes cached), average reads are often 1-2.

I/O cost:

Point lookup: O(h) page reads (often 1-5).
Range scan of k contiguous entries: O(h + k / (B / E_leaf)) page reads because sequential leaf pages are read.
Insert/delete: O(h) reads/writes plus occasional splits/merges (local writes).

CPU complexity:

Within-node search: binary search on keys (O(log m)), or linear scan for small m. Many implementations use binary search or SIMD-accelerated search.

Space efficiency:

Minimum fill factor typically 50% (or 66% for B* tree), so storage overhead due to sparsity is bounded.

Disk/memory-oriented considerations and storage layout

B-tree designs optimize for blocks/IOs:

Node = disk page (4KB, 8KB, etc.)
Pack many keys into a node to maximize fanout and minimize height.
Keep internal nodes small and few; typically root and upper levels stay cached in memory.

Leaf layout options:

Fixed-length entries: straightforward slot array; faster but wasteful for variable-length keys.
Variable-length keys with slot array: compact storage using variable-length items at the end of the page and a slot table of offsets.

Important techniques:

Prefix compression in internal nodes: store only separator prefixes sufficient to distinguish children to reduce internal node size and increase fanout.
Sibling pointers in leaves: allow fast in-order traversal and simplified concurrency.
Overflow/indirection: for very large values, store a pointer to a blob stored elsewhere (e.g., TOAST in PostgreSQL).

Buffer management:

Strong interaction with the DB buffer pool or OS page cache impacts effective I/O cost.
LRU or clock replacement policies target keeping upper levels and hot leaf pages in memory.

Page size choice:

Larger page size increases fanout and reduces height but increases cost for random modifications because more unrelated records share a page and writes are larger. On SSDs with low latency, smaller pages can be advantageous for write amplification considerations.

Practical aspects in databases

Primary (clustered) vs secondary indexes

Clustered index: when the table rows are stored physically in leaf pages of the B+ tree ordered by the primary key (InnoDB). The leaf contains the entire row. Only one clustered index per table.
Secondary index: leaf entries store indexed key plus pointer to the row (either physical row location, rowid, or the clustered ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.