Database Indexing Explained for Backend Engineers

Indexes are one of the most powerful and also most misunderstood tools in a backend engineer’s toolbox. Done right, they reduce latency for reads by orders of magnitude. Done wrong, they bloat storage, slow down writes, confuse the query planner, and still not deliver the expected improvements.

This article provides a deep, practical and theory-grounded dive into database indexing targeted at backend engineers who design, tune, and operate data-driven systems.

Table of contents

  • Introduction and motivation
  • A short history and evolution of indexing
  • Core concepts and theoretical foundations
  • Types of indexes (relational & specialized)
  • How indexes are used by query engines
  • Index design patterns and strategies
  • Index maintenance, monitoring and costs
  • Indexes in distributed and NoSQL systems
  • Practical examples (Postgres / MySQL / Cassandra / Elastic)
  • Tuning tips, trade-offs and heuristics
  • Future directions
  • Quick checklist and best practices
  • Further reading

Introduction and motivation

What is an index?

  • An index is a data structure that accelerates access to rows in a table by one or more columns. Instead of scanning the whole table, the database can locate matching rows quickly using the index.

Why indexes matter for backend engineers:

  • They are central to application performance: query latency, throughput, and resource utilization depend heavily on indexes.
  • They affect write performance: every index requires maintenance during inserts/updates/deletes.
  • They impact storage, replication size, and backup/restore times.
  • They influence concurrency and locking behavior.

Use cases where indexes help:

  • Lookups by equality (find user by id, email).
  • Range scans (find events between timestamps).
  • ORDER BY/LIMIT optimization (top-K queries).
  • JOINs on indexed join keys.
  • GROUP BY and aggregation pre-filtering.
  • Full-text search (via inverted indexes).
  • Spatial and similarity searches (via R-tree/GiST/GiN, vector indexes).

A short history and evolution of indexing

  • Early systems (1960s–1970s): simple B-tree and ISAM indexing.
  • RDBMS era (1980s–1990s): B-tree as de-facto standard for relational row stores. Secondary indexes, clustered/unclustered, unique constraints implemented via indexes.
  • Disk and memory hierarchy considerations: index design historically targeted minimizing disk I/O.
  • Emergence of column stores, LSM-based systems, and inverted indexes (2000s): For write-heavy workloads and columnar analytics, alternatives like LSM trees and bitmap indexes gained traction.
  • Modern landscape (2010s–present): distributed databases, SSDs/NVMe, persistent memory, full-text and vector search engines. New indexing structures and learned indexes research.

Core concepts and theoretical foundations

  • Search tree basics:
    • B-tree / B+tree organizes keys in pages/nodes with fan-out > 2; disk-friendly due to large branching.
    • Height ≈ log_fanout(N); large fanouts mean small heights (often 2–4 levels).
  • Selectivity and cardinality:
    • Selectivity = fraction of rows matching a predicate. Highly selective predicates benefit most from indexes.
    • Cardinality = number of distinct values for a column. High cardinality => good candidate for equality index; low cardinality (boolean, flags) often not.
  • Covering vs non-covering indexes:
    • Covering (index-only) queries can be satisfied fully from the index (no table row lookup).
  • Clustered vs non-clustered (clustered = data physically ordered by index):
    • InnoDB: primary key is clustered index (table is the B-tree).
    • Postgres: no true clustered index but you can CLUSTER a table (physically reorder once).
  • Cost model and planner:
    • The query planner estimates costs using statistics; misestimation leads to suboptimal plans.
  • Consistency & concurrency:
    • Index maintenance must be concurrency-safe (MVCC, locking, latches).
  • Storage trade-offs:
    • Indexes duplicate data — choose columns carefully.
  • Write amplification and page splits:
    • Inserts/updates can cause splits and additional IO; fillfactor impacts this.
  • Bloom filters and filters in distributed stores:
    • LSM-based systems use bloom filters to avoid unnecessary SSTable lookups.

Types of indexes

Relational row-store index types you will encounter:

  1. B-tree (B+tree)

    • General-purpose, supports equality and range queries.
    • Efficient for ORDER BY, GROUP BY and index scans.
    • Used by Postgres, MySQL/InnoDB, Oracle, SQL Server.
  2. Hash index

    • Fast equality lookup (hash table); usually not good for range queries.
    • Historically limited in functionality, often not WAL-friendly.
    • Examples: Postgres has hash indexes (but B-tree preferred), many KVS use hash tables.
  3. Bitmap index

    • Efficient for low-cardinality columns common in analytics (OLAP).
    • Compact and supports bitwise operations; not suited for high-concurrency OLTP updates.
    • Used by columnar/analytic systems and some RDBMS features.
  4. GiST (Generalized Search Tree)

    • Extensible, supports spatial, range, and many custom operators.
    • Used by Postgres for PostGIS, array ranges, etc.
  5. GIN (Generalized Inverted Index)

    • Designed for indexing composite values (arrays, full-text tokens, JSONB).
    • Fast for many-to-many mapping (token -> rows).
    • In Postgres used for tsvector, array, JSONB keys.
  6. R-tree and spatial indexes

    • Optimized for geometric and spatial queries; used by GIS extensions like PostGIS.
  7. Inverted index (full-text)

    • Token -> posting list of documents/rows. Supports scoring and boolean/full-text queries.
    • Used by ElasticSearch, Lucene, Postgres tsvector + GIN.
  8. LSM-tree (Log-Structured Merge-tree) backed indexes

    • Write-optimized: batched writes to memtable and periodic compaction into SSTables.
    • Used by Cassandra, RocksDB, LevelDB, HBase. Often used for highly-ingested workloads.
  9. Vector and ANN indexes (HNSW, IVF)

    • For approximate nearest neighbor (ANN) searches on embeddings. Used in modern ML search flows.
  10. Functional/Expression indexes

    • Indexes built on expressions (lower(email)) or functions (to_tsvector(body)).
  11. Partial indexes

    • Index subsets of rows (WHERE clause) for better space & performance when queries always filter by a condition.
  12. Composite / multi-column indexes

    • Indexes containing multiple columns; order matters.

How query engines use indexes (planner & executor)

  • Index lookup types:
    • Index seek (point lookup): go down tree to find exact key(s).
    • Index range scan: lower/upper bound walk to scan contiguous key range.
    • Index-only scan: all required columns are in index; no heap lookup.
    • Full index scan: scanning index structure fully — still may be cheaper than a table scan if narrower.
  • Plan choice depends on:
    • Estimated result cardinality (selectivity)
    • Cost per I/O for index access vs table scan
    • Whether ORDER BY or LIMIT can benefit from index ordering
    • Available statistics and histograms
  • Index intersection / bitmap AND/OR:
    • Some databases combine multiple single-column indexes rather than using multi-column index; performance depends on engine.
  • Example: ORDER BY + LIMIT
    • A sorted index on the ORDER BY column can provide the top N without sorting the rows.

Index design patterns and strategies

  1. Index your primary lookup columns

    • Every table should have a primary key (clustered/unclustered) — fastest access path and uniqueness guarantee.
  2. Index common WHERE predicates

    • Columns frequently used in equality or range filters.
  3. Multi-column indexes: choose column order carefully

    • Index (a, b) supports WHERE a = ? AND b = ? and WHERE a = ?.
    • Does not efficiently support WHERE b = ? alone (unless the engine supports index skip-scan).
    • For ORDER BY (a, b) the index order must match the ORDER BY order (consider direction for DESC).
  4. Covering indexes (include/INCLUDE)

    • Add extra non-key columns to the index leaf to create an index-only plan (Postgres: INCLUDE since 11; SQL Server: INCLUDE; InnoDB: clustering provides included PK).
  5. Use partial and filtered indexes

    • If queries always filter out rows (e.g., active = true), index only active rows.
  6. Expression/Functional indexes

    • Index normalized forms like lower(email) or materialized computed columns.
  7. Be careful with low-cardinality columns

    • Bitmaps might help analytics, but regular indexing on booleans often hurts due to low selectivity.
  8. Avoid indexing volatile columns

    • Frequently updated columns increase index maintenance cost; consider if read benefit outweighs write cost.
  9. Foreign keys and join keys

    • Index FK columns for join performance and referential integrity maintenance.
  10. Limit number of indexes

  • Each additional index increases write cost and storage. Balance reads vs writes.
  1. Consider multi-tenant/sharding impacts
  • Partitioning and shard keys may require different indexing strategies (local vs global index concerns).

Index maintenance, monitoring and costs

Costs:

  • Space: indexes duplicate data and metadata (pointers, tuples).
  • Write overhead: inserts/updates/deletes update each relevant index.
  • Fragmentation and bloat: updates may leave dead entries (MVCC) requiring vacuum/rebuild.
  • Planner misestimates: if statistics outdated, poor plans may be chosen.

Maintenance tasks:

  • ANALYZE/UPDATE STATISTICS: keeps histograms accurate.
  • VACUUM (Postgres): reclaims dead tuples and reduces bloat (VACUUM FULL rebuilds).
  • REINDEX: rebuilds an index when corrupted or highly bloated.
  • CLUSTER (Postgres): physically reorder table by index—improves locality but is heavy-weight.
  • CREATE INDEX CONCURRENTLY: builds index without exclusive lock (Postgres).
  • Fillfactor: reserve free space in pages to reduce page splits.

Monitoring index usage:

  • Postgres: pg_stat_user_indexes, pg_statio_user_indexes, pg_stat_all_indexes.
  • MySQL: Information_schema.INNODB_INDEX_STATS and sys schema views for usage.
  • SQL Server: sys.dm_db_index_usage_stats and sys.dm_db_index_physical_stats.
  • Look for indexes with 0 usage and high maintenance cost — candidates to drop.

Practical metrics to track:

  • Index hit ratio (cache vs disk)
  • Number of scans vs number of tuples fetched
  • Bloat percentage (size vs expected)
  • Write amplification/cost per write
  • Average page split frequency

Example (Postgres — find unused indexes):

SQL
1SELECT schemaname, relname, indexrelname, idx_scan 2FROM pg_stat_user_indexes 3JOIN pg_index ON pg_index.indexrelid = pg_stat_user_indexes.indexrelid 4WHERE idx_scan = 0 5 AND NOT indisprimary;

Indexes in distributed and NoSQL systems

  1. LSM-tree vs B-tree

    • LSM (Cassandra, RocksDB): write-optimized; small writes in memtable, periodic compaction into sorted files (SSTables).
    • B-tree (RDBMS): update-in-place, good for read-heavy and range scans.
    • Trade-offs: LSM provides high write throughput with potential read amplification and compaction overhead, whereas B-trees provide predictable read latency and lower read amplification.
  2. Cassandra

    • Primary key = partition key + clustering columns; partition key determines node placement.
    • Secondary indexes have pitfalls (local to node, can be inefficient); generally prefer modeling queries into primary key or using materialized views (with caution).
    • Use local secondary index for narrow cardinality or indexing frequently filtered columns.
  3. MongoDB

    • Supports compound, multi-key (array) indexes, hashed indexes for sharding, text indexes for full-text search.
    • Indexes are global per collection (not per shard unless sharded collection).
  4. ElasticSearch / Lucene

    • Inverted index for full-text: token -> postings list with positions for phrase queries and scoring.
    • Supports analyzers, tokenizers, and various scoring models.
  5. Distributed SQL

    • Global secondary indexes vs local indexes: global indexes require coordination and cross-shard consistency, complicating writes.
    • Some systems avoid global indexes to maintain single-key writes.
  6. Vector search / ANN indexes

    • Engines like FAISS/HNSW approximate KNN using specialized indexes; increasingly integrated into DBs, requiring new indexing notions.

Practical examples

Examples use PostgreSQL and MySQL/InnoDB, demonstrating common patterns.

  1. Create a simple table and index (Postgres)
SQL
1CREATE TABLE users ( 2 id bigserial PRIMARY KEY, 3 email text NOT NULL, 4 created_at timestamptz NOT NULL DEFAULT now(), 5 active boolean NOT NULL DEFAULT true 6); 7 8-- Index email for lookup: 9CREATE UNIQUE INDEX users_email_idx ON users (email); 10 11-- Partial index for active users: 12CREATE INDEX users_active_created_idx ON users (created_at) 13 WHERE active = true;
  1. Use EXPLAIN to see effect Before index:
SQL
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]'; -- Planner might show: Seq Scan on users (time X)

After index:

SQL
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]'; -- Planner shows Index Scan using users_email_idx (time Y << X)
  1. Covering index (INCLUDE) to avoid heap fetch (Postgres 11+)
SQL
CREATE INDEX users_email_created_idx ON users (email) INCLUDE (created_at, active); -- Now SELECT email, created_at, active WHERE email = ? can be index-only.
  1. Multicolumn index ordering
SQL
1-- For queries like WHERE city = ? AND state = ?: 2CREATE INDEX idx_city_state ON addresses (city, state); 3 4-- This index also supports WHERE city = ? alone, but not WHERE state = ? alone efficiently.
  1. Full-text with GIN
SQL
ALTER TABLE articles ADD COLUMN tsv tsvector; UPDATE articles SET tsv = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); CREATE INDEX articles_tsv_gin_idx ON articles USING gin(tsv);

Query:

SQL
1SELECT id, title FROM articles 2WHERE tsv @@ plainto_tsquery('english', 'postgres indexing') 3ORDER BY ts_rank(tsv, plainto_tsquery('postgres indexing')) DESC 4LIMIT 10;
  1. Expression index example
SQL
CREATE INDEX idx_lower_email ON users (lower(email)); -- Use with WHERE lower(email) = '[email protected]'
  1. LSM-system example (Cassandra primary key)
Plain Text
1CREATE TABLE events ( 2 user_id uuid, 3 event_time timestamp, 4 event_type text, 5 payload text, 6 PRIMARY KEY (user_id, event_time) -- partition key user_id, clustering by event_time 7); 8-- Query: SELECT * FROM events WHERE user_id = ? AND event_time > ?

Tuning tips, trade-offs and heuristics

When to create an index:

  • Frequent reads filtering on the column(s) with high selectivity.
  • Columns used in JOIN, ORDER BY, and GROUP BY where sorting can be avoided.
  • Columns used for uniqueness constraints.

When not to create an index:

  • Low selectivity columns (booleans, tiny enums) unless combined with other columns or used in partial indexes.
  • Columns that are updated frequently with no read benefit.
  • Temporary or small tables where full scans are cheap.

Heuristics:

  • If a query returns > ~5–10% of the table, a sequential scan may be cheaper than an indexed plan (depends on table width and system).
  • Prefer composite indexes for multi-column equality filters rather than multiple single-column indexes (planner dependent).
  • Monitor and drop unused indexes.
  • Favor covering indexes for hot queries that do many lookups.
  • Use partial indexes to reduce index size and maintenance when queries filter by a predicate.
  • Consider fillfactor of 70–90% for write-heavy tables to reduce page splits; for read-heavy, higher fillfactor is OK.

Testing:

  • Use realistic datasets and EXPLAIN ANALYZE to measure real-world effects.
  • Measure latency P95/P99, not only averages.
  • Test under concurrent loads to observe contention and lock behavior.

Common pitfalls and anti-patterns

  • Indexing every column "just in case" — leads to high write cost and storage waste.
  • Assuming index will speed queries without looking at execution plan.
  • Forgetting to update statistics after bulk loads — planner misestimates.
  • Indexing highly-updated columns without considering write amplification.
  • Using non-selective indexes expecting speed-ups (e.g., boolean flags).
  • Relying on database defaults without tuning fillfactor or vacuum settings for heavy OLTP.
  • In distributed systems, creating global secondary indexes causes large write amplification and coordination overhead.

  • Learned indexes: using machine learning models to predict key positions (research, some experimental DBs).
  • Adaptive and self-tuning indexes: systems that create/drop indexes automatically using workload telemetry.
  • Integration of vector/ANN indexes into databases as first-class citizens for search on embeddings.
  • Hardware-aware index design for NVMe and persistent memory—smaller latencies and different trade-offs change optimal indexing.
  • HTAP: hybrid transactional-analytical systems blend row and columnar storage, requiring flexible indexing strategies.
  • Indexes for privacy: encrypted or privacy-preserving indexes to enable queries while protecting data.

Quick checklist and best practices

  • Always have a primary key.
  • Index columns used frequently in WHERE, JOIN, and ORDER BY.
  • Favor compound indexes when queries filter on multiple columns together. Order matters.
  • Favor covering (INCLUDE) indexes for hot queries to eliminate table lookups.
  • Use partial indexes when predicates are stable and filter subsets.
  • Use expression indexes for normalization (lower(), functions).
  • Avoid indexing low-cardinality, volatile columns unless part of composite / partial index.
  • Monitor index usage and drop unused indexes.
  • Keep stats updated (ANALYZE); vacuum/reindex as needed.
  • Consider the underlying storage engine (B-tree vs LSM) and platform (SSD vs HDD).
  • Test with representative workloads and concurrency.

Further reading and references

  • PostgreSQL official docs: Index Types, CREATE INDEX, EXPLAIN
  • MySQL docs: InnoDB indexes, EXPLAIN
  • Cassandra docs: Primary key, clustering, secondary indexes
  • RocksDB & LevelDB docs: LSM architecture
  • Papers:
    • "The Design and Implementation of Modern Column-Oriented DBMSs"
    • "The Case for Learned Index Structures" — Kraska et al.
    • Research on adaptive indexing and fractal trees.

Closing notes

Indexing is more art than science: it combines theory (B-trees, LSM), empirical observation (run EXPLAIN ANALYZE), and pragmatic trade-offs (storage vs write cost). For backend engineers building production systems, indexes should be designed iteratively: instrument, hypothesize, measure, and refine. Keep an eye on changing workload characteristics — the indexes that were perfect last month may be the drag on performance next month.

If you want, I can:

  • Analyze an EXPLAIN ANALYZE output from your production query and recommend indexes.
  • Suggest concrete index designs for a table schema and a set of queries.
  • Provide scripts to find unused indexes and estimate write cost per index for PostgreSQL or MySQL.