Database Indexing Explained for Backend Engineers
Indexes are one of the most powerful and also most misunderstood tools in a backend engineer’s toolbox. Done right, they reduce latency for reads by orders of magnitude. Done wrong, they bloat storage, slow down writes, confuse the query planner, and still not deliver the expected improvements.
This article provides a deep, practical and theory-grounded dive into database indexing targeted at backend engineers who design, tune, and operate data-driven systems.
Table of contents
- Introduction and motivation
- A short history and evolution of indexing
- Core concepts and theoretical foundations
- Types of indexes (relational & specialized)
- How indexes are used by query engines
- Index design patterns and strategies
- Index maintenance, monitoring and costs
- Indexes in distributed and NoSQL systems
- Practical examples (Postgres / MySQL / Cassandra / Elastic)
- Tuning tips, trade-offs and heuristics
- Future directions
- Quick checklist and best practices
- Further reading
Introduction and motivation
What is an index?
- An index is a data structure that accelerates access to rows in a table by one or more columns. Instead of scanning the whole table, the database can locate matching rows quickly using the index.
Why indexes matter for backend engineers:
- They are central to application performance: query latency, throughput, and resource utilization depend heavily on indexes.
- They affect write performance: every index requires maintenance during inserts/updates/deletes.
- They impact storage, replication size, and backup/restore times.
- They influence concurrency and locking behavior.
Use cases where indexes help:
- Lookups by equality (find user by id, email).
- Range scans (find events between timestamps).
- ORDER BY/LIMIT optimization (top-K queries).
- JOINs on indexed join keys.
- GROUP BY and aggregation pre-filtering.
- Full-text search (via inverted indexes).
- Spatial and similarity searches (via R-tree/GiST/GiN, vector indexes).
A short history and evolution of indexing
- Early systems (1960s–1970s): simple B-tree and ISAM indexing.
- RDBMS era (1980s–1990s): B-tree as de-facto standard for relational row stores. Secondary indexes, clustered/unclustered, unique constraints implemented via indexes.
- Disk and memory hierarchy considerations: index design historically targeted minimizing disk I/O.
- Emergence of column stores, LSM-based systems, and inverted indexes (2000s): For write-heavy workloads and columnar analytics, alternatives like LSM trees and bitmap indexes gained traction.
- Modern landscape (2010s–present): distributed databases, SSDs/NVMe, persistent memory, full-text and vector search engines. New indexing structures and learned indexes research.
Core concepts and theoretical foundations
- Search tree basics:
- B-tree / B+tree organizes keys in pages/nodes with fan-out > 2; disk-friendly due to large branching.
- Height ≈ log_fanout(N); large fanouts mean small heights (often 2–4 levels).
- Selectivity and cardinality:
- Selectivity = fraction of rows matching a predicate. Highly selective predicates benefit most from indexes.
- Cardinality = number of distinct values for a column. High cardinality => good candidate for equality index; low cardinality (boolean, flags) often not.
- Covering vs non-covering indexes:
- Covering (index-only) queries can be satisfied fully from the index (no table row lookup).
- Clustered vs non-clustered (clustered = data physically ordered by index):
- InnoDB: primary key is clustered index (table is the B-tree).
- Postgres: no true clustered index but you can CLUSTER a table (physically reorder once).
- Cost model and planner:
- The query planner estimates costs using statistics; misestimation leads to suboptimal plans.
- Consistency & concurrency:
- Index maintenance must be concurrency-safe (MVCC, locking, latches).
- Storage trade-offs:
- Indexes duplicate data — choose columns carefully.
- Write amplification and page splits:
- Inserts/updates can cause splits and additional IO; fillfactor impacts this.
- Bloom filters and filters in distributed stores:
- LSM-based systems use bloom filters to avoid unnecessary SSTable lookups.
Types of indexes
Relational row-store index types you will encounter:
- B-tree (B+tree)
- General-purpose, supports equality and range queries.
- Efficient for ORDER BY, GROUP BY and index scans.
- Used by Postgres, MySQL/InnoDB, Oracle, SQL Server.
- Hash index
- Fast equality lookup (hash table); usually not good for range queries.
- Historically limited in functionality, often not WAL-friendly.
- Examples: Postgres has hash indexes (but B-tree preferred), many KVS use hash tables.
- Bitmap index
- Efficient for low-cardinality columns common in analytics (OLAP).
- Compact and supports bitwise operations; not suited for high-concurrency OLTP updates.
- Used by columnar/analytic systems and some RDBMS features.
- GiST (Generalized Search Tree)
- Extensible, supports spatial, range, and many custom operators.
- Used by Postgres for PostGIS, array ranges, etc.
- GIN (Generalized Inverted Index)
- Designed for indexing composite values (arrays, full-text tokens, JSONB).
- Fast for many-to-many mapping (token -> rows).
- In Postgres used for tsvector, array, JSONB keys.
- R-tree and spatial indexes
- Optimized for geometric and spatial queries; used by GIS extensions like PostGIS.
- Inverted index (full-text)
- Token -> posting list of documents/rows. Supports scoring and boolean/full-text queries.
- Used by ElasticSearch, Lucene, Postgres tsvector + GIN.
- LSM-tree (Log-Structured Merge-tree) backed indexes
- Write-optimized: batched writes to memtable and periodic compaction into SSTables.
- Used by Cassandra, RocksDB, LevelDB, HBase. Often used for highly-ingested workloads.
- Vector and ANN indexes (HNSW, IVF)
- For approximate nearest neighbor (ANN) searches on embeddings. Used in modern ML search flows.
- Functional/Expression indexes
- Indexes built on expressions (lower(email)) or functions (to_tsvector(body)).
- Partial indexes
- Index subsets of rows (WHERE clause) for better space & performance when queries always filter by a condition.
- Composite / multi-column indexes
- Indexes containing multiple columns; order matters.
How query engines use indexes (planner & executor)
- Index lookup types:
- Index seek (point lookup): go down tree to find exact key(s).
- Index range scan: lower/upper bound walk to scan contiguous key range.
- Index-only scan: all required columns are in index; no heap lookup.
- Full index scan: scanning index structure fully — still may be cheaper than a table scan if narrower.
- Plan choice depends on:
- Estimated result cardinality (selectivity)
- Cost per I/O for index access vs table scan
- Whether ORDER BY or LIMIT can benefit from index ordering
- Available statistics and histograms
- Index intersection / bitmap AND/OR:
- Some databases combine multiple single-column indexes rather than using multi-column index; performance depends on engine.
- Example: ORDER BY + LIMIT
- A sorted index on the ORDER BY column can provide the top N without sorting the rows.
Index design patterns and strategies
- Index your primary lookup columns
- Every table should have a primary key (clustered/unclustered) — fastest access path and uniqueness guarantee.
- Index common WHERE predicates
- Columns frequently used in equality or range filters.
- Multi-column indexes: choose column order carefully
- Index (a, b) supports WHERE a = ? AND b = ? and WHERE a = ?.
- Does not efficiently support WHERE b = ? alone (unless the engine supports index skip-scan).
- For ORDER BY (a, b) the index order must match the ORDER BY order (consider direction for DESC).
- Covering indexes (include/INCLUDE)
- Add extra non-key columns to the index leaf to create an index-only plan (Postgres: INCLUDE since 11; SQL Server: INCLUDE; InnoDB: clustering provides included PK).
- Use partial and filtered indexes
- If queries always filter out rows (e.g., active = true), index only active rows.
- Expression/Functional indexes
- Index normalized forms like lower(email) or materialized computed columns.
- Be careful with low-cardinality columns
- Bitmaps might help analytics, but regular indexing on booleans often hurts due to low selectivity.
- Avoid indexing volatile columns
- Frequently updated columns increase index maintenance cost; consider if read benefit outweighs write cost.
- Foreign keys and join keys
- Index FK columns for join performance and referential integrity maintenance.
- Limit number of indexes
- Each additional index increases write cost and storage. Balance reads vs writes.
- Consider multi-tenant/sharding impacts
- Partitioning and shard keys may require different indexing strategies (local vs global index concerns).
Index maintenance, monitoring and costs
Costs:
- Space: indexes duplicate data and metadata (pointers, tuples).
- Write overhead: inserts/updates/deletes update each relevant index.
- Fragmentation and bloat: updates may leave dead entries (MVCC) requiring vacuum/rebuild.
- Planner misestimates: if statistics outdated, poor plans may be chosen.
Maintenance tasks:
- ANALYZE/UPDATE STATISTICS: keeps histograms accurate.
- VACUUM (Postgres): reclaims dead tuples and reduces bloat (VACUUM FULL rebuilds).
- REINDEX: rebuilds an index when corrupted or highly bloated.
- CLUSTER (Postgres): physically reorder table by index—improves locality but is heavy-weight.
- CREATE INDEX CONCURRENTLY: builds index without exclusive lock (Postgres).
- Fillfactor: reserve free space in pages to reduce page ...