B-Trees & B+ Trees: The Data Structure Behind Database Indexes

Series: System Design · Data & Storage — Pillar 4 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Data & Storage: Where Everything Lives | Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars) |
| 01 | SQL vs NoSQL: Choosing the Right Database | SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call. |
| 02 | Database Indexing: The Highest-Leverage Performance Tool | Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one. |
| 03 | B-Trees & B+ Trees: The Data Structure Behind Database Indexes ← you are here | Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries. |
| 04 | LSM Trees: Why Some Databases Are Built for Writes | LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it. |
| 05 | Denormalisation: Trading Storage for Speed | Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely. |
| 06 | Database Sharding: Scaling Beyond a Single Node | Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces. |
| 07 | Data Partitioning: Choosing How to Divide Your Data | Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale. |
| 08 | Consistent Hashing: Minimising Resharding Pain | Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems. |
| 09 | Replication & Read Replicas: Scaling Reads and Surviving Failures | Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them. |
| 10 | Object Storage: Unlimited Scale for Large Binary Data | Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it. |
| 11 | Block vs File vs Object Storage: Three Models, Three Use Cases | Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them. |
| 12 | Distributed File Systems: File Storage Across Many Machines | Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them. |
| 13 | Time Series Databases: Built for Metrics and Events | Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB. |
| 14 | Vector Databases: Semantic Search and AI Memory | Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one. |
| 15 | Full-Text Search Engines: Beyond SQL LIKE | Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful. |
| 16 | Materialized Views: Pre-Computing Expensive Queries | Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches. |
| 17 | Query Optimisation: From Slow to Fast | Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast. |
| 18 | Connection Pooling: Managing the Hidden Bottleneck | Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly. |
| 19 | Data & Storage: Wrap-Up | A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system. |
B-Trees & B+ Trees: The Data Structure Behind Database Indexes
The problem
You added an index to the links table and queries went from seconds to milliseconds. But someone on your team asks: why? What is the database actually doing differently?
"There's a data structure" is not an answer that satisfies a curious engineer. And it matters more than you think: the specific shape of a B+ tree is why the leftmost column rule exists, why range queries on indexed columns are fast, why composite indexes help some queries and not others, and why index depth affects query performance on very large tables.
Understanding the data structure behind the index isn't academic trivia. It's the model that lets you reason about which indexes to create and why.
The core idea
A B+ tree is a self-balancing tree data structure where every leaf node is at the same depth, internal nodes store keys used for navigation, and all actual data pointers live in the leaf nodes, which are linked together in a sorted chain. This structure guarantees O(log n) lookup, O(log n) insertion, and efficient range scans — all on data far too large to fit in memory.
The analogy: a library catalogue with a sorted shelf
Imagine a library where books are shelved alphabetically by author. To find all books by authors from "King" to "McEwan":
- Start at the catalogue index — the top-level guide showing which floor and aisle covers "K"
- Walk to the aisle, then to the shelf labelled "K"
- Scan along the shelf from "King" until you reach "McEwan", collecting every book in between
The catalogue is the internal nodes of the B+ tree — purely navigational, not the actual data. The shelves with the books are the leaf nodes — the actual data, sorted and physically adjacent. The ability to scan along the shelf once you've found the start is the linked leaf list.
Without this structure, finding all books from K to M means checking every book in the library. With it, you navigate once and scan linearly. This is why B+ trees make range queries fast.
How it works
The B-tree
A B-tree (balanced tree) has three properties that make it suited to database storage:
Balanced: all leaf nodes are at the same depth. There's no "lucky path" where some lookups are faster than others — every key lookup traverses exactly the same number of levels.
Wide, not deep: each node can hold many keys (often hundreds), not just two like a binary tree. A B-tree with branching factor 200 and 50 million rows has depth log₂₀₀(50,000,000) ≈ 4 levels. Four disk reads to find any row in a 50 million row table.
On-disk friendly: each node is sized to match a disk page (typically 8KB or 16KB). Reading a node requires exactly one disk I/O. Binary trees need one disk read per level and may have millions of levels on large datasets. B-trees minimise disk reads by keeping branching factor high.
The B+ tree (what databases actually use)
PostgreSQL, MySQL InnoDB, and most relational databases use B+ trees rather than plain B-trees. The key differences:
Internal nodes store keys only, not data. In a plain B-tree, internal nodes can store full row data. In a B+ tree, internal nodes store only keys used for navigation. This means more keys fit in each internal node, increasing the branching factor and reducing tree depth.
All data lives in leaf nodes. Leaf nodes contain the indexed key value and a pointer to the actual row on disk (the "heap file pointer" in PostgreSQL). Retrieving the full row requires following this pointer — unless the index is a covering index, in which case the leaf node contains everything the query needs.
Leaf nodes are linked. Adjacent leaf nodes hold pointers to the next leaf node, forming a sorted linked list of all indexed keys. This is the feature that makes range queries efficient.
A lookup: finding short_code = 'x7Kp2'
Starting from the root (internal node):
Root: ['aaa', 'mno', 'zzz']
↓ ↓ ↓
[A–M] [M–Z] [>Z]
- Compare
'x7Kp2'against root keys → falls in[M–Z]branch - Navigate to the M–Z internal node, compare again → narrow to sub-range
- Continue navigating internal nodes (3–4 levels for 50M rows)
- Arrive at the leaf node containing
'x7Kp2'→ read the row pointer - Follow the pointer to the actual row in the table heap
Total: 4–5 node reads (disk I/Os), not 50 million row comparisons.
A range scan: finding all links created between two dates
SELECT * FROM links WHERE created_at BETWEEN '2025-01-01' AND '2025-01-31';
With a B+ tree index on created_at:
- Navigate the tree to the leaf node containing
'2025-01-01' - Follow the linked-list pointer to the next leaf, then the next, collecting all rows with
created_at ≤ '2025-01-31' - Stop when the key value exceeds the upper bound
The linked leaf list means range scans are sequential reads along the leaf level — fast, cache-friendly, and proportional only to the number of matching rows, not the total table size.
This is why B+ trees handle range queries well. It's also why a B+ tree index on a column used in ORDER BY eliminates the need for a sort step — the data is already in order at the leaf level.
Why the leftmost column rule exists
A composite index on (user_id, created_at) is a B+ tree whose keys are tuples (user_id, created_at), sorted first by user_id, then by created_at within each user_id.
Leaf nodes (simplified):
... | (101, 2025-01-01) | (101, 2025-03-15) | (102, 2024-12-01) | (103, 2025-02-28) | ...
A query on WHERE user_id = 101 can navigate the tree because the keys are sorted by user_id first — the tree knows exactly where user_id = 101 entries start.
A query on WHERE created_at = '2025-01-01' cannot use this index to jump to the right place — the leaf nodes aren't sorted by created_at globally. created_at = '2025-01-01' could appear under any user_id. The index provides no narrowing.
A query on WHERE user_id = 101 AND created_at = '2025-01-01' can navigate to the exact leaf node because it provides both components of the sort key in order.
This is the leftmost column rule: a composite index is usable only when the query provides a prefix of the indexed columns from left to right.
Index depth and performance
For a B+ tree with branching factor 200:
- 1 level: 200 leaf nodes → up to ~1,600 keys (assuming 8 keys per leaf)
- 2 levels: 200² = 40,000 leaf nodes → up to ~320,000 keys
- 3 levels: 200³ = 8M leaf nodes → up to ~64M keys
- 4 levels: 200⁴ = 1.6B leaf nodes → up to ~12.8B keys
Most production databases operate at depth 3–4, meaning 3–4 disk reads to find any row regardless of table size. This is why B+ trees scale so well — adding more rows makes the tree one level deeper only at enormous scales.
In practice, the upper levels of the tree (root and the first few internal levels) are almost always in the database's buffer pool (in-memory cache), so most index lookups require only 1–2 actual disk reads.
The tradeoffs
Read performance vs write performance. Every insert requires finding the right position in the B+ tree and inserting the key. When a leaf node is full, it splits — a node split propagates up the tree. Splitting is fast, but it's extra work that a sequential write (no index) doesn't pay. High write rates on heavily indexed tables feel this overhead.
Page fill factor. Databases let you configure the initial fill factor for index pages — the fraction of each page to fill when initially building the index. A fill factor of 70% leaves 30% of each page empty, reducing split frequency on tables with frequent inserts. The cost is a larger index on disk and potentially more levels.
Index bloat. PostgreSQL's MVCC (multi-version concurrency control) means deleted rows leave "dead" index entries that must be reclaimed by VACUUM. Tables with high update/delete rates accumulate bloat, requiring regular maintenance.
When this matters
Understanding B+ tree structure helps you:
- Design composite indexes correctly — put the most selective column first for equality queries; think about range vs equality when ordering columns
- Understand covering indexes — an index that includes all queried columns means no heap lookup, because everything needed is in the leaf node
- Reason about index size — the index depth determines how many I/Os a lookup costs; on very large tables with narrow pages, a deep tree can mean more disk reads
- Explain range query performance — "why does
ORDER BY created_atbecome fast after adding this index?" because the leaf level is already sorted
The one thing to remember
A B+ tree index makes lookup fast by navigating from root to leaf in O(log n) steps, and makes range scans fast by linking leaf nodes in sorted order. The tree is sorted by the indexed key(s) from left to right — which is exactly why providing the leftmost column(s) of a composite index allows the database to navigate the tree, and providing only rightmost columns doesn't.
← Previous: Database Indexing: The Highest-Leverage Performance Tool — Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide wh...
→ Next: LSM Trees: Why Some Databases Are Built for Writes — LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade...




