A learning path ready to make your own.

Database Indexing Explained for Backend Engineers

Database Indexing — Concise Summary for Backend Engineers Indexes are data structures that accelerate row access by column(s). Proper indexing drastically reduces read latency; misused indexes increase storage, slow writes, mislead planners, and add operational overhead. Index design is an iterative mix of theory, measurement, and workload-aware trade-offs. Why indexes matter Performance: Key to query latency, throughput and resource use. Writes & storage: Every index adds write overhead, storage, replication and backup cost. Concurrency/locking: Index maintenance interacts with MVCC, locks and latches. Use cases: equality lookups, ranges, ORDER BY/LIMIT, JOINs, GROUP BY, full-text, spatial and vector searches. Core concepts B-tree / B+tree: Disk-friendly tree with large fanout; common general-purpose index. Selectivity & cardinality: High selectivity/high cardinality are good index candidates. Covering vs non-covering: Covering (index-only) queries avoid table lookups. Clustered vs non-clustered: Clustered indexes order table storage (e.g., InnoDB PK). Planner & stats: Query planner uses statistics; stale stats lead to bad plans. Write amplification & fragmentation: Page splits, fillfactor, MVCC bloat and compaction matter. Filters (Bloom filters): Used in LSM systems to reduce unnecessary seeks. Common index types B-tree / B+tree — equality and range, ORDER BY/GROUP BY support. Hash — fast equality, poor for ranges (limited use in RDBMS). Bitmap — compact for low-cardinality OLAP use; poor for high-concurrency OLTP updates. GiST / GIN — extensible: GiST for spatial/range; GIN for inverted/composite values (arrays, tsvector, JSONB). R-tree / spatial — geometric queries (PostGIS). Inverted / full-text — token → postings for search (Elastic/Lucene, GIN). LSM-based — write-optimized (Cassandra, RocksDB) with memtables and SSTables. Vector / ANN — HNSW, IVF for approximate nearest neighbours on embeddings. Functional / partial / composite — expression indexes, filtered/partial indexes, multi-column indexes (order matters). How query engines use indexes Lookup modes: point seek, range scan, index-only scan, full-index scan. Planner decisions depend on estimated cardinality, I/O costs, ORDER BY/LIMIT benefits and available stats. Some engines intersect multiple indexes (bitmap AND/OR) or use multi-column indexes when beneficial. Index design patterns & best practices Always have a primary key; index primary lookup columns. Index frequent WHERE, JOIN, ORDER BY and GROUP BY columns. Choose multi-column index order to match query filters and ORDER BY. Use covering (INCLUDE) indexes for hot queries to avoid heap lookups. Use partial/filtered indexes when queries always filter by a predicate. Use expression indexes for normalized/searchable forms (lower(), to_tsvector()). Avoid indexing low-cardinality or highly volatile columns unless part of composite/partial strategies. Limit number of indexes to balance read benefit vs write cost. Account for partitioning/sharding: global indexes add coordination and write amplification. Maintenance, monitoring & costs Costs: space duplication, write overhead, fragmentation/MVCC bloat, planner misestimation. Tasks: ANALYZE/update stats, VACUUM/REINDEX/CLUSTER (Postgres), CREATE INDEX CONCURRENTLY, tune fillfactor. Monitor: index usage (pg_stat_user_indexes, sys views), hit rates, bloat %, scans vs tuples, page splits, write amplification. Prune unused indexes—use metrics to identify candidates to drop. Indexes in distributed & NoSQL systems LSM vs B-tree: LSMs optimize writes (memtable + compaction) with read amplification; B-trees favor in-place updates and predictable reads. Cassandra: Partition key controls placement; design queries into primary key; secondary indexes are limited. MongoDB: supports compound, multi-key, hashed and text indexes; consider sharding effects. Elasticsearch / Lucene: inverted index with analyzers and scoring for full-text. Distributed SQL: global vs local secondary indexes trade off consistency and write cost. Vector/ANN: specialized indexes integrated increasingly into DBs for embeddings. Practical examples (high level) Create unique indexes for lookups (email), partial indexes for active-only queries, and INCLUDE columns to build covering indexes. Use EXPLAIN ANALYZE to compare seq scan vs index scan and measure real query costs (latency P95/P99). In LSM stores (Cassandra) model access via partition key + clustering columns for efficient range queries. Use GIN for tsvector/JSONB full-text and expression indexes for lowercasing/normalization. Tuning heuristics & trade-offs Create indexes for high-selectivity reads, JOIN/ORDER/GROUP predicates, and uniqueness requirements. Avoid indexes on low-selectivity or frequently-updated columns. Rule of thumb: if a query returns > ~5–10% of rows, a sequential scan may be cheaper (system-dependent). Tune fillfactor (70–90%) for write-heavy tables to reduce page splits; higher for read-heavy tables. Test with realistic datasets and concurrent workloads; prefer EXPLAIN ANALYZE and latency tail metrics. Common pitfalls Indexing every column “just in case.” Assuming speed-ups without checking execution plans or updating stats after bulk loads. Indexing highly-updated columns or relying on non-selective indexes. Unaware defaults for fillfactor, vacuum or compaction settings in heavy OLTP. Global secondary indexes in distributed systems causing excessive coordination. Future directions Learned and adaptive/self-tuning indexes. First-class vector/ANN indexing inside databases for ML-driven search. Hardware-aware index structures for NVMe and persistent memory. HTAP and privacy-preserving / encrypted indexes. Quick checklist Have a primary key. Index frequent WHERE/JOIN/ORDER BY columns; prefer compound indexes when filtering on multiple columns. Use covering (INCLUDE) and partial indexes for hot queries and filtered subsets. Use expression indexes for normalized comparisons. Avoid indexes on low-cardinality/volatile columns unless justified. Monitor usage, keep stats updated, vacuum/reindex as needed, and drop unused indexes. Test with realistic, concurrent workloads and measure tail latency. Closing: Indexing blends theory and empirical tuning: instrument, hypothesize, measure, and iterate as workloads evolve. If you want, I can analyze an EXPLAIN ANALYZE output, propose concrete indexes for a schema and queries, or provide scripts to find unused indexes and estimate write cost for Postgres/MySQL.

Let the lesson walk with you.

Podcast

Database Indexing Explained for Backend Engineers podcast

0:00-4:05

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

Database Indexing Explained for Backend Engineers flashcards

16 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

Database Indexing Explained for Backend Engineers quiz

12 questions

What is the primary purpose of a database index as described in the material?

Read deeper, connect wider, own the subject.

Deep Article

Database Indexing Explained for Backend Engineers

Indexes are one of the most powerful and also most misunderstood tools in a backend engineer’s toolbox. Done right, they reduce latency for reads by orders of magnitude. Done wrong, they bloat storage, slow down writes, confuse the query planner, and still not deliver the expected improvements.

This article provides a deep, practical and theory-grounded dive into database indexing targeted at backend engineers who design, tune, and operate data-driven systems.

Table of contents

  • Introduction and motivation
  • A short history and evolution of indexing
  • Core concepts and theoretical foundations
  • Types of indexes (relational & specialized)
  • How indexes are used by query engines
  • Index design patterns and strategies
  • Index maintenance, monitoring and costs
  • Indexes in distributed and NoSQL systems
  • Practical examples (Postgres / MySQL / Cassandra / Elastic)
  • Tuning tips, trade-offs and heuristics
  • Future directions
  • Quick checklist and best practices
  • Further reading

Introduction and motivation

What is an index?

  • An index is a data structure that accelerates access to rows in a table by one or more columns. Instead of scanning the whole table, the database can locate matching rows quickly using the index.

Why indexes matter for backend engineers:

  • They are central to application performance: query latency, throughput, and resource utilization depend heavily on indexes.
  • They affect write performance: every index requires maintenance during inserts/updates/deletes.
  • They impact storage, replication size, and backup/restore times.
  • They influence concurrency and locking behavior.

Use cases where indexes help:

  • Lookups by equality (find user by id, email).
  • Range scans (find events between timestamps).
  • ORDER BY/LIMIT optimization (top-K queries).
  • JOINs on indexed join keys.
  • GROUP BY and aggregation pre-filtering.
  • Full-text search (via inverted indexes).
  • Spatial and similarity searches (via R-tree/GiST/GiN, vector indexes).

A short history and evolution of indexing

  • Early systems (1960s–1970s): simple B-tree and ISAM indexing.
  • RDBMS era (1980s–1990s): B-tree as de-facto standard for relational row stores. Secondary indexes, clustered/unclustered, unique constraints implemented via indexes.
  • Disk and memory hierarchy considerations: index design historically targeted minimizing disk I/O.
  • Emergence of column stores, LSM-based systems, and inverted indexes (2000s): For write-heavy workloads and columnar analytics, alternatives like LSM trees and bitmap indexes gained traction.
  • Modern landscape (2010s–present): distributed databases, SSDs/NVMe, persistent memory, full-text and vector search engines. New indexing structures and learned indexes research.

Core concepts and theoretical foundations

  • Search tree basics:
  • B-tree / B+tree organizes keys in pages/nodes with fan-out > 2; disk-friendly due to large branching.
  • Height ≈ log_fanout(N); large fanouts mean small heights (often 2–4 levels).
  • Selectivity and cardinality:
  • Selectivity = fraction of rows matching a predicate. Highly selective predicates benefit most from indexes.
  • Cardinality = number of distinct values for a column. High cardinality => good candidate for equality index; low cardinality (boolean, flags) often not.
  • Covering vs non-covering indexes:
  • Covering (index-only) queries can be satisfied fully from the index (no table row lookup).
  • Clustered vs non-clustered (clustered = data physically ordered by index):
  • InnoDB: primary key is clustered index (table is the B-tree).
  • Postgres: no true clustered index but you can CLUSTER a table (physically reorder once).
  • Cost model and planner:
  • The query planner estimates costs using statistics; misestimation leads to suboptimal plans.
  • Consistency & concurrency:
  • Index maintenance must be concurrency-safe (MVCC, locking, latches).
  • Storage trade-offs:
  • Indexes duplicate data — choose columns carefully.
  • Write amplification and page splits:
  • Inserts/updates can cause splits and additional IO; fillfactor impacts this.
  • Bloom filters and filters in distributed stores:
  • LSM-based systems use bloom filters to avoid unnecessary SSTable lookups.

Types of indexes

Relational row-store index types you will encounter:

  1. B-tree (B+tree)
  • General-purpose, supports equality and range queries.
  • Efficient for ORDER BY, GROUP BY and index scans.
  • Used by Postgres, MySQL/InnoDB, Oracle, SQL Server.
  1. Hash index
  • Fast equality lookup (hash table); usually not good for range queries.
  • Historically limited in functionality, often not WAL-friendly.
  • Examples: Postgres has hash indexes (but B-tree preferred), many KVS use hash tables.
  1. Bitmap index
  • Efficient for low-cardinality columns common in analytics (OLAP).
  • Compact and supports bitwise operations; not suited for high-concurrency OLTP updates.
  • Used by columnar/analytic systems and some RDBMS features.
  1. GiST (Generalized Search Tree)
  • Extensible, supports spatial, range, and many custom operators.
  • Used by Postgres for PostGIS, array ranges, etc.
  1. GIN (Generalized Inverted Index)
  • Designed for indexing composite values (arrays, full-text tokens, JSONB).
  • Fast for many-to-many mapping (token -> rows).
  • In Postgres used for tsvector, array, JSONB keys.
  1. R-tree and spatial indexes
  • Optimized for geometric and spatial queries; used by GIS extensions like PostGIS.
  1. Inverted index (full-text)
  • Token -> posting list of documents/rows. Supports scoring and boolean/full-text queries.
  • Used by ElasticSearch, Lucene, Postgres tsvector + GIN.
  1. LSM-tree (Log-Structured Merge-tree) backed indexes
  • Write-optimized: batched writes to memtable and periodic compaction into SSTables.
  • Used by Cassandra, RocksDB, LevelDB, HBase. Often used for highly-ingested workloads.
  1. Vector and ANN indexes (HNSW, IVF)
  • For approximate nearest neighbor (ANN) searches on embeddings. Used in modern ML search flows.
  1. Functional/Expression indexes
  • Indexes built on expressions (lower(email)) or functions (to_tsvector(body)).
  1. Partial indexes
  • Index subsets of rows (WHERE clause) for better space & performance when queries always filter by a condition.
  1. Composite / multi-column indexes
  • Indexes containing multiple columns; order matters.

How query engines use indexes (planner & executor)

  • Index lookup types:
  • Index seek (point lookup): go down tree to find exact key(s).
  • Index range scan: lower/upper bound walk to scan contiguous key range.
  • Index-only scan: all required columns are in index; no heap lookup.
  • Full index scan: scanning index structure fully — still may be cheaper than a table scan if narrower.
  • Plan choice depends on:
  • Estimated result cardinality (selectivity)
  • Cost per I/O for index access vs table scan
  • Whether ORDER BY or LIMIT can benefit from index ordering
  • Available statistics and histograms
  • Index intersection / bitmap AND/OR:
  • Some databases combine multiple single-column indexes rather than using multi-column index; performance depends on engine.
  • Example: ORDER BY + LIMIT
  • A sorted index on the ORDER BY column can provide the top N without sorting the rows.

Index design patterns and strategies

  1. Index your primary lookup columns
  • Every table should have a primary key (clustered/unclustered) — fastest access path and uniqueness guarantee.
  1. Index common WHERE predicates
  • Columns frequently used in equality or range filters.
  1. Multi-column indexes: choose column order carefully
  • Index (a, b) supports WHERE a = ? AND b = ? and WHERE a = ?.
  • Does not efficiently support WHERE b = ? alone (unless the engine supports index skip-scan).
  • For ORDER BY (a, b) the index order must match the ORDER BY order (consider direction for DESC).
  1. Covering indexes (include/INCLUDE)
  • Add extra non-key columns to the index leaf to create an index-only plan (Postgres: INCLUDE since 11; SQL Server: INCLUDE; InnoDB: clustering provides included PK).
  1. Use partial and filtered indexes
  • If queries always filter out rows (e.g., active = true), index only active rows.
  1. Expression/Functional indexes
  • Index normalized forms like lower(email) or materialized computed columns.
  1. Be careful with low-cardinality columns
  • Bitmaps might help analytics, but regular indexing on booleans often hurts due to low selectivity.
  1. Avoid indexing volatile columns
  • Frequently updated columns increase index maintenance cost; consider if read benefit outweighs write cost.
  1. Foreign keys and join keys
  • Index FK columns for join performance and referential integrity maintenance.
  1. Limit number of indexes
  • Each additional index increases write cost and storage. Balance reads vs writes.
  1. Consider multi-tenant/sharding impacts
  • Partitioning and shard keys may require different indexing strategies (local vs global index concerns).

Index maintenance, monitoring and costs

Costs:

  • Space: indexes duplicate data and metadata (pointers, tuples).
  • Write overhead: inserts/updates/deletes update each relevant index.
  • Fragmentation and bloat: updates may leave dead entries (MVCC) requiring vacuum/rebuild.
  • Planner misestimates: if statistics outdated, poor plans may be chosen.

Maintenance tasks:

  • ANALYZE/UPDATE STATISTICS: keeps histograms accurate.
  • VACUUM (Postgres): reclaims dead tuples and reduces bloat (VACUUM FULL rebuilds).
  • REINDEX: rebuilds an index when corrupted or highly bloated.
  • CLUSTER (Postgres): physically reorder table by index—improves locality but is heavy-weight.
  • CREATE INDEX CONCURRENTLY: builds index without exclusive lock (Postgres).
  • Fillfactor: reserve free space in pages to reduce page ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.