B-tree Indexes — A Deep Dive
B-tree indexes are among the foundational data structures of modern database systems, filesystems, and storage engines. They provide efficient, balanced, disk-oriented ordered indexing that supports exact-match lookups, ordered range scans, and efficient updates. This article gives a comprehensive treatment: history, structure and invariants, algorithmic details, practical implications for disk and memory, concurrency and recovery, real-world implementations, trade-offs versus alternatives (notably LSM-trees), tuning and deployment guidance, and future directions.
Table of contents
- Introduction and motivation
- Brief history and lineage
- B-tree family: B-tree, B+ tree, B* tree, B-link tree
- Structure, invariants, and node layout
- Algorithms: search, insert (split), delete (merge/redistribute)
- Complexity, fanout, and I/O cost analysis
- Disk/memory-oriented considerations and storage layout
- Practical aspects in databases
- Primary vs secondary (clustered) indexes
- Compound keys, covering indexes, partial and expression indexes
- Bulk loading, fill factor and maintenance
- Prefix/truncation compression, overflow handling
- Concurrency, locking, and recovery
- Lock coupling, optimistic concurrency, latch-free variants
- MVCC interactions and HOT updates
- Write-ahead logging (WAL) and crash recovery
- Implementation examples in popular systems
- PostgreSQL, InnoDB (MySQL), SQLite, Oracle, filesystems (Btrfs)
- B-tree vs alternatives (hash, LSM, radix/ART, learned indexes)
- Advanced topics and optimizations
- Cache-awareness, prefix compression, key ordering
- B+ tree sibling pointers, B-link trees
- Variable-length keys, overflow pages
- Future trends: NVM/PMEM, learned/hybrid indexes, hardware changes
- Practical recommendations / tuning checklist
- Example pseudocode and SQL examples
- Further reading and seminal references
Introduction and motivation
When datasets exceed memory and must be stored on block-addressable storage (disks, SSDs), index structures must minimize I/O. B-tree indexes are balanced trees designed so each node matches a block (page) on disk. By maximizing fanout (number of children per internal node), they keep tree height small, reducing the number of I/Os per lookup. They support:
- Exact lookup (equality)
- Range queries (scans)
- Ordered traversal (ORDER BY)
- Efficient insert/delete/updates (logarithmic height; local splits/merges)
Because the structure stores keys in sorted order and groups many keys per node, B-tree indexes are particularly good for range queries and for workloads requiring low read latency.
Brief history and lineage
- 1970s: B-trees were introduced by Rudolf Bayer and Edward McCreight in 1972 as a balanced tree optimized for secondary storage (Bayer & McCreight, "Organization and Maintenance of Large Ordered Indexes").
- Variants emerged:
- B+ tree: stores all data/values in leaves linked sequentially; internal nodes store only keys. Widely used in DBMS and filesystems.
- B* tree: variant designed to keep nodes fuller by redistributing keys among siblings before splitting.
- B-link tree (Lehman and Yao, 1981): adds right-sibling pointers and high keys to enable lock-free/less-blocking concurrency during splits.
Since their inception, B-tree families have become the default indexing structure in most relational databases, key-value stores, and many filesystems.
B-tree family: B-tree, B+ tree, B* tree, B-link tree
- B-tree (general): internal nodes store keys and child pointers; keys separate child subranges. Leaves may also store records/pointers.
- B+ tree: all actual records (or pointers to records) are stored in leaves. Internal nodes contain only keys and child pointers. Leaves are often doubly or singly linked to support fast ordered scans. This is the most common form for database indexes.
- B* tree: an optimization that tries to keep nodes at least 2/3 full instead of 1/2; on insert, siblings are used to redistribute before splitting.
- B-link tree: a B+ tree with right sibling pointers at every level and additional high-key info to make concurrent splits safe without global coordination; used in concurrent implementations.
Key practical note: when people say "B-tree index" in modern DBMS, they most often mean a B+ tree implementation.
Structure, invariants, and node layout
Basic B-tree invariants (B-tree of order m):
- Each node can have at most m children (i.e., up to m-1 keys).
- Each internal node (except root) must have at least ⌈m/2⌉ children.
- All leaves are at the same depth (tree is height-balanced).
- Keys within a node are sorted.
For B+ tree specifics:
- Internal nodes contain only separators and child pointers.
- Leaf nodes contain key-pointer pairs (pointers to rows or the row itself in clustered indexes).
- Leaf nodes are linked sequentially for efficient range scans.
Node layout commonly includes:
- Header (node type, item count, pointers to parent/left/right sibling)
- Array of keys (or key prefixes)
- Array of pointers (child pointers in internals, record pointers in leaves)
- Optional free space and slot directory (for variable-length keys)
- Checksums/LSNs for recovery
ASCII representation (simplified B+ leaf node): [ header | n | key1 -> recptr1 | key2 -> recptr2 | ... | free space ]
For disk-oriented layout, nodes usually align with page sizes (e.g., 4 KB, 8 KB) to allow efficient I/O.
Algorithms
B-tree operations are straightforward conceptually but have practical subtleties for disk storage, concurrency, and variable-length keys. Below are the core algorithms in brief, followed by annotated pseudocode.
Search (lookup)
- Start at root.
- At each internal node, binary-search (or linear scan for small node) the key array to find the child pointer whose subrange contains the search key.
- Move to that child and repeat until a leaf node is reached.
- In leaf: search keys for equality (or locate starting position for range scan).
I/O cost ~= number of levels visited (one page read per visited node if not cached).
Insert (high level)
- Find the leaf where the key should go.
- Insert the key into the leaf in sorted order.
- If the leaf overflows (exceeds max entries), split it into two nodes:
- For B+ tree, promote the smallest key of the new right node to the parent as the separator.
- If parent overflows due to new separator, recursively split parent.
- If root splits, create a new root and increase tree height by 1.
Splits are local operations typically involving one page read/write and one new page write.
Delete (high level)
- Find leaf containing key and remove it.
- If the leaf becomes underfull (fewer entries than min), try to borrow (redistribute) an entry from a sibling.
- If redistribution is not possible (sibling at minimum), merge with sibling and remove separator from parent.
- If parent underflows, recursively borrow/merge upward. If root ends up with a single child, shrink root (height--).
Pseudocode (simplified)
Search pseudocode: ``text function BTreeSearch(node, key): if node.isLeaf: return node.findKey(key) // return pointer or NOT_FOUND else: i = node.findChildIndexForKey(key) child = node.childPointers[i] return BTreeSearch(child, key) ``
Insert pseudocode (simplified, non-concurrent, B+ semantics): ```text function BTreeInsert(tree, key, value): leaf = findLeaf(tree.root, key) insertIntoLeaf(leaf, key, value) if leaf.isOverflow(): newLeaf, separator = splitLeaf(leaf) insertIntoParent(leaf.parent, separator, leaf, newLeaf)
function insertIntoParent(parent, key, leftChild, rightChild): if parent == null: // create new root newRoot = Node(keys=[key], children=[leftChild, rightChild]) tree.root = newRoot return parent.insertKeyAndChild(key, rightChild) if parent.isOverflow(): newParent, separator = splitInternal(parent) insertIntoParent(parent.parent, separator, parent, newParent) ```
Delete pseudocode sketch omitted here for brevity because it requires many cases (borrow left/right, merge, propagate). Implementations must carefully manage sibling pointers and parent separators.
Complexity, fanout, and I/O cost analysis
Let:
- B = page size (bytes)
- P = pointer size (bytes) — often 8 bytes on 64-bit
- K = average key size (bytes)
- E = space per entry ≈ K + P (for internal nodes might be separator keys + child pointers)
- m ≈ floor(B / E) ≈ approximate maximum children per internal node (fanout)
Height bound:
- Each internal node has at least t = ceil(m/2) children (except root).
- For n keys (or n records), approximate height h satisfies:
n ≤ m^h => h ≥ logm n (very loose) More precise lower bound for height: h ≤ floor(log{t} ((n + 1) / 2)) (when counting nodes vs keys; constants omitted)
Practical consequence: because m is large (hundreds), h remains small: Example:
- B = 4096 bytes, P = 8 bytes, K = 16 bytes → E ≈ 24 → m ≈ 170.
- For n = 1 billion (~1e9), h ≈ log_{170}(1e9) ≈ ln(1e9)/ln(170) ≈ 20.7/5.14 ≈ 4.0.
So only 4-5 levels are typical—meaning many lookups need only 4-5 random page reads in the worst case. With caching (root and upper internal nodes cached), average reads are often 1-2.
I/O cost:
- Point lookup: O(h) page reads (often 1-5).
- Range scan of k contiguous entries: O(h + k / (B / E_leaf)) page reads because sequential leaf pages are read.
- Insert/delete: O(h) reads/writes plus occasional splits/merges (local writes).
CPU complexity:
- Within-node search: binary search on keys (O(log m)), or linear scan for small m. Many implementations use binary search or SIMD-accelerated search.
Space efficiency:
- Minimum fill factor typically 50% (or 66% for B* tree), so storage overhead due to sparsity is bounded.
Disk/memory-oriented considerations and storage layout
B-tree designs optimize for blocks/IOs:
- Node = disk page (4KB, 8KB, etc.)
- Pack many keys into a node to maximize fanout and minimize height.
- Keep internal nodes small and few; typically root and upper levels stay cached in memory.
Leaf layout options:
- Fixed-length entries: straightforward slot array; faster but wasteful for variable-length keys.
- Variable-length keys with slot array: compact storage using variable-length items at the end of the page and a slot table of offsets.
Important techniques:
- Prefix compression in internal nodes: store only separator prefixes sufficient to distinguish children to reduce internal node size and increase fanout.
- Sibling pointers in leaves: allow fast in-order traversal and simplified concurrency.
- Overflow/indirection: for very large values, store a pointer to a blob stored elsewhere (e.g., TOAST in PostgreSQL).
Buffer management:
- Strong interaction with the DB buffer pool or OS page cache impacts effective I/O cost.
- LRU or clock replacement policies target keeping upper levels and hot leaf pages in memory.
Page size choice:
- Larger page size increases fanout and reduces height but increases cost for random modifications because more unrelated records share a page and writes are larger. On SSDs with low latency, smaller pages can be advantageous for write amplification considerations.
Practical aspects in databases
Primary (clustered) vs secondary indexes
- Clustered index: when the table rows are stored physically in leaf pages of the B+ tree ordered by the primary key (InnoDB). The leaf contains the entire row. Only one clustered index per table.
- Secondary index: leaf entries store indexed key plus pointer to the row (either physical row location, rowid, or the clustered ...