# Database Sharding: Scaling Beyond a Single Node

> **Series:** System Design · Data & Storage — Pillar 4 of 8

## Systems Design

| # | Post | What it covers |
|---|------|----------------|
| 00 | [Data & Storage: Where Everything Lives](/data-storage-where-everything-lives) | Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars) |
| 01 | [SQL vs NoSQL: Choosing the Right Database](/sql-vs-nosql-choosing-the-right-database) | SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call. |
| 02 | [Database Indexing: The Highest-Leverage Performance Tool](/database-indexing-the-highest-leverage-performance-tool) | Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one. |
| 03 | [B-Trees & B+ Trees: The Data Structure Behind Database Indexes](/b-trees-b-trees-the-data-structure-behind-database-indexes) | Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries. |
| 04 | [LSM Trees: Why Some Databases Are Built for Writes](/lsm-trees-why-some-databases-are-built-for-writes) | LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it. |
| 05 | [Denormalisation: Trading Storage for Speed](/denormalisation-trading-storage-for-speed) | Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely. |
| 06 | **Database Sharding: Scaling Beyond a Single Node** ← you are here | Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces. |
| 07 | [Data Partitioning: Choosing How to Divide Your Data](/data-partitioning-choosing-how-to-divide-your-data) | Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale. |
| 08 | [Consistent Hashing: Minimising Resharding Pain](/consistent-hashing-minimising-resharding-pain) | Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems. |
| 09 | [Replication & Read Replicas: Scaling Reads and Surviving Failures](/replication-read-replicas-scaling-reads-and-surviving-failures) | Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them. |
| 10 | [Object Storage: Unlimited Scale for Large Binary Data](/object-storage-unlimited-scale-for-large-binary-data) | Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it. |
| 11 | [Block vs File vs Object Storage: Three Models, Three Use Cases](/block-vs-file-vs-object-storage-three-models-three-use-cases) | Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them. |
| 12 | [Distributed File Systems: File Storage Across Many Machines](/distributed-file-systems-file-storage-across-many-machines) | Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them. |
| 13 | [Time Series Databases: Built for Metrics and Events](/time-series-databases-built-for-metrics-and-events) | Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB. |
| 14 | [Vector Databases: Semantic Search and AI Memory](/vector-databases-semantic-search-and-ai-memory) | Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one. |
| 15 | [Full-Text Search Engines: Beyond SQL LIKE](/full-text-search-engines-beyond-sql-like) | Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful. |
| 16 | [Materialized Views: Pre-Computing Expensive Queries](/materialized-views-pre-computing-expensive-queries) | Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches. |
| 17 | [Query Optimisation: From Slow to Fast](/query-optimisation-from-slow-to-fast) | Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast. |
| 18 | [Connection Pooling: Managing the Hidden Bottleneck](/connection-pooling-managing-the-hidden-bottleneck) | Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly. |
| 19 | [Data & Storage: Wrap-Up](/data-storage-wrap-up) | A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system. |

---

# Database Sharding: Scaling Beyond a Single Node

## The problem

Your URL shortener's PostgreSQL instance is running on the largest single machine available: 96 CPU cores, 768GB RAM, 20TB NVMe storage, and 100Gbps network. You've tuned every parameter, added every index, and the database is still the bottleneck.

The `clicks` table has 500 billion rows. The dataset is 80TB — it won't fit on any single machine. Write throughput peaks at 2 million clicks per second during major events. Even the largest PostgreSQL instance in the world can't absorb that write rate.

You've hit the vertical scaling ceiling. The only way forward is horizontal — splitting the data across multiple database nodes, each responsible for a subset of the data. This is sharding.

---

## The core idea

Sharding (also called horizontal partitioning) divides a database's data across multiple physical nodes called shards. Each shard holds a subset of the rows and is responsible for reads and writes on its subset. The collection of shards together holds the complete dataset.

---

## The analogy: a library network expanding to multiple branches

A single library serves a city well — until the city grows, the collection exceeds available shelf space, and the number of visitors overwhelms the building. The solution isn't a taller building. It's opening branches.

Each branch holds a portion of the collection: the central branch holds A–F, the east branch holds G–P, the west branch holds Q–Z. Finding a book requires knowing which branch to visit. A request for a book that spans multiple branches requires coordination between branches.

Sharding is the same strategy applied to databases. Each shard is a branch. The routing layer (which branch holds this book?) is the shard key logic. Cross-shard queries (give me books from both the A–F and G–P collections) require coordination between shards — the expensive operation that makes sharding complex.

---

## How it works

### The shard key

The shard key is the value used to determine which shard holds a given row. Every row in a sharded table has a shard key value; that value determines its shard.

For the URL shortener's clicks table, `link_id` is a natural shard key: all clicks for a given link live on the same shard, making per-link analytics queries shard-local (they don't require cross-shard coordination).

```
Shard 0: link_ids 0–24,999,999
Shard 1: link_ids 25,000,000–49,999,999
Shard 2: link_ids 50,000,000–74,999,999
Shard 3: link_ids 75,000,000–99,999,999
```

### Shard key selection criteria

**Cardinality:** the shard key must have many distinct values. Sharding on a boolean (two values) means two shards maximum. Sharding on `user_id` or `link_id` (millions of values) provides fine-grained distribution.

**Even distribution:** data must distribute approximately evenly across shards. A shard key that causes 70% of data to land on one shard creates a "hot shard" — one node overwhelmed while others are underutilised.

**Query alignment:** the most frequent queries should be shard-local. If most queries filter by `user_id`, shard by `user_id`. If your most frequent query requires filtering by a column that isn't the shard key, that query is a cross-shard scatter-gather — much more expensive.

**Immutability:** the shard key should not change after row creation. Moving a row from one shard to another (because its shard key changed) requires deleting from the old shard and inserting into the new one — expensive and error-prone.

---

### Sharding strategies

**Range-based sharding**

Rows with shard key values in a given range go to a specific shard. Simple to implement; easy to understand which shard holds which data.

```
User IDs 1–1,000,000 → Shard A
User IDs 1,000,001–2,000,000 → Shard B
User IDs 2,000,001–3,000,000 → Shard C
```

Problem: **range hotspots**. New users are always assigned incrementing IDs, so new writes always hit the highest-numbered shard. That shard bears all the write load; older shards are mostly read-only. This is the "hot tail" problem.

**Hash-based sharding**

Apply a hash function to the shard key and assign the row to a shard based on the hash value.

```
shard = hash(link_id) % num_shards
```

Hash-based sharding distributes writes evenly across shards — consecutive values hash to different buckets. The problem: **rebalancing is expensive**. When you add a fourth shard to a three-shard cluster, the modulus changes. `hash(key) % 3` and `hash(key) % 4` produce different results for most keys. Nearly every row needs to move to a different shard — a massive, expensive data migration. Consistent hashing (covered in the next post) solves this problem.

**Directory-based sharding**

A lookup table (the "directory") maps each shard key value (or range) to a shard. The routing layer consults the directory for every query.

```
link_id x7Kp2 → Shard 2
link_id pQ9mX → Shard 5
```

Pros: flexible, no mathematical constraint on distribution. Cons: the directory is a single point of failure and a bottleneck. The directory must be highly available and fast.

---

### Cross-shard queries

The most significant operational challenge of sharding is that many queries become cross-shard.

**Scatter-gather:** a query that doesn't include the shard key must be sent to all shards in parallel, and the results assembled. "Show me all links created today" — without a `link_id` filter, this query goes to every shard.

```
Query → Shard 0 ┐
Query → Shard 1 ├→ Aggregate results → Return to client
Query → Shard 2 │
Query → Shard 3 ┘
```

Latency is dominated by the slowest shard. Aggregate functions (COUNT, SUM, AVG) must be computed per-shard and merged.

**Cross-shard joins:** joining two sharded tables on a non-shard key requires either loading data from multiple shards and joining in the application layer, or denormalising to avoid the join.

**Cross-shard transactions:** ACID transactions that touch rows on multiple shards require distributed transaction protocols (two-phase commit, etc.) — complex, slow, and a potential availability problem if one shard is unavailable.

---

### Application-level vs middleware sharding

**Application-level sharding:** the application code decides which shard to query based on the shard key. Simple to implement; requires each service to understand the sharding logic.

**Middleware sharding proxy (e.g. Vitess, Citus):** a proxy layer between the application and the database shards handles routing transparently. The application thinks it's talking to one database; the proxy routes queries to the right shard(s). More complex to operate, but hides sharding from application developers.

---

## The tradeoffs

**Operational complexity.** A sharded database is multiple databases. Each shard needs backups, monitoring, failover, and capacity management. Schema migrations must be applied to every shard. Incidents affecting one shard can be hard to diagnose.

**Loss of ACID across shards.** Transactions spanning multiple shards lose the simplicity of single-node ACID. Distributed transactions are possible but expensive.

**Cross-shard query overhead.** Any query that doesn't include the shard key is a scatter-gather. If your access patterns aren't well-aligned with your shard key, many queries become expensive.

**Resharding.** As data grows, you'll eventually need to add shards. With hash-based sharding, this requires migrating large amounts of data. Plan for resharding from the beginning, or use consistent hashing.

**The bar for sharding is high.** A single PostgreSQL instance, properly tuned with read replicas, can handle terabytes of data and tens of thousands of queries per second. Most applications never need to shard. Sharding before hitting genuine scale adds enormous complexity with no benefit.

---

## When to shard

**Shard when:**
- The dataset genuinely exceeds what a single node can hold
- Write throughput exceeds what a single primary node can handle (read replicas help with reads but not writes)
- You need to place data in specific geographic regions for compliance or latency reasons

**Don't shard when:**
- You're solving a read problem (use read replicas instead — much simpler)
- Your data fits comfortably on a single large instance
- You haven't yet exhausted vertical scaling options (bigger instances are often available and far simpler)
- You're doing it speculatively for "future scale"

---

## The one thing to remember

> **Sharding solves the problem of data and write volume exceeding a single node's capacity — but it does so at the cost of significant operational complexity and the loss of cross-shard ACID transactions.** The shard key is the most consequential design decision in a sharded system: a poor choice leads to hot shards, cross-shard scatter-gathers, and expensive resharding. Choose the key that aligns with your most frequent access patterns and distributes data evenly — or pay the consequences in production.

---

*← Previous: **[Denormalisation: Trading Storage for Speed](/denormalisation-trading-storage-for-speed)** — Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to...*

*→ Next: **[Data Partitioning: Choosing How to Divide Your Data](/data-partitioning-choosing-how-to-divide-your-data)** — Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, m...*

