Database Indexing

The single most impactful thing to understand for database performance. An index is a precomputed shortcut that lets the database find matching rows without scanning every row in the table.

The library analogy

Imagine a library with a million books stacked in no order.

  • No index → to find “all books by Dostoevsky”, you read the title page of every book. One million lookups. This is a table scan.
  • With an index → there’s a separate card catalog sorted by author. You flip to “D”, find Dostoevsky’s cards, each tells you exactly which shelf to go to. Maybe a dozen lookups total.

That’s what a database index is: a separate, sorted data structure that points back to the actual rows.

What an index actually is

Most databases use B-tree (balanced tree) indexes by default. A B-tree keeps keys sorted in a tree structure where each node has many children (hundreds is typical).

                     [M]
                  /        \
             [F]            [T]
           /     \        /     \
        [B,D]  [H,J,L] [O,R]  [U,W,Y]

Properties:

  • O(log n) search, insert, delete — so a billion-row table still takes just ~30 steps to find a key
  • Leaves are linked, so range queries (WHERE age BETWEEN 20 AND 30) walk the leaves sequentially
  • Naturally sorted, so ORDER BY can sometimes skip sorting entirely

Other index types exist for special workloads:

  • Hash — exact match only (no range queries); faster than B-tree for = lookups
  • GIN (PostgreSQL) — inverted index for arrays, JSON, full-text search
  • GiST — geospatial, range types
  • BRIN — block-range, great for huge time-series tables with natural ordering

What gets indexed (and what doesn’t)

By default:

  • Primary key → always indexed (automatically)
  • Unique constraints → automatically indexed
  • Foreign keys → often not automatically indexed (depends on the DB) — a common source of slow queries

You add indexes manually for:

  • Columns you filter by in WHERE clauses
  • Columns you join on
  • Columns you sort by in ORDER BY

The four levels of query efficiency

Learning to read query plans (EXPLAIN in most databases) is the key skill. From worst to best:

PlanWhat it meansWhen OK
Sequential scanRead every rowSmall tables, or when most rows match
Index scanUse index to find rows, then read each rowSelective queries
Index-only scanAll needed columns are in the index — no table accessCarefully designed covering indexes
Index skip/rangeWalk a small portion of the indexRange queries on indexed columns

A sequential scan over a billion-row table is the classic “my query is slow” root cause.

The cost of indexes

Indexes aren’t free:

  1. Disk space — a B-tree index on a 1 TB table can easily be 100 GB. Multiply by N indexes.
  2. Write slowdown — every INSERT / UPDATE / DELETE has to update every index on the table.
  3. Cache pressure — indexes compete with data for RAM.

Rule of thumb: every index helps some queries and hurts all writes. Only add an index if you can point to the query it accelerates.

Composite indexes

An index on (country, city, age) — the order matters.

This index accelerates:

  • WHERE country = 'Georgia'
  • WHERE country = 'Georgia' AND city = 'Tbilisi'
  • WHERE country = 'Georgia' AND city = 'Tbilisi' AND age = 30

It does not accelerate:

  • WHERE city = 'Tbilisi' ✗ (can’t use the index; wrong leading column)
  • WHERE age = 30

Leftmost-prefix rule: a composite index on (a, b, c) helps queries that filter on a, or a+b, or a+b+c — not on b alone or c alone.

Covering indexes / “include” columns

An index that contains all the columns a query needs is a covering index → the database answers the query entirely from the index, never touching the table. Dramatic speedup.

CREATE INDEX ON orders (customer_id) INCLUDE (total, order_date);
-- Query: SELECT total, order_date FROM orders WHERE customer_id = 42;
-- Reads just the index. The table is untouched.

When indexes don’t help

  • Low-selectivity columns — an index on “is_active” (boolean) is usually worse than a table scan, because every row roughly matches anyway
  • Functions on indexed columnsWHERE LOWER(email) = '...' can’t use an index on email unless you create a functional index on LOWER(email)
  • Leading wildcardsWHERE name LIKE '%foo%' can’t use a standard index; needs full-text search
  • Implicit type conversionWHERE id = '42' where id is numeric can skip the index

How to find missing indexes

Three practical techniques:

  1. Slow query log — enable it, watch for queries that take >100 ms (or whatever your threshold is). Look at the top 10 offenders.
  2. EXPLAIN / EXPLAIN ANALYZE — run on suspect queries. Find the sequential scans.
  3. Index advisorspg_stat_statements + extensions in Postgres, Query Store in SQL Server, Performance Schema in MySQL — all can surface the queries that would benefit most.

How to find useless indexes

  • pg_stat_user_indexes in Postgres shows “how many times this index was actually used.” Zero uses over weeks → probably a candidate for removal.
  • Duplicate indexes (same columns, same order) exist surprisingly often in legacy systems. Drop them.
  • Indexes on columns that are never queried. Drop them.

See also