# Network Partitions: The Failure Mode You Can't Design Away

> **Series:** System Design · Distributed Systems — Pillar 8 of 8

## Systems Design

| # | Post | What it covers |
| --- | --- | --- |
| 00 | [Distributed Systems: What Happens When Machines Disagree](/distributed-systems-what-happens-when-machines-disagree) | Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar. |
| 01 | **Network Partitions: The Failure Mode You Can't Design Away** ← you are here | Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice. |
| 02 | [Split-Brain: When Two Nodes Both Think They're the Leader](/split-brain-when-two-nodes-both-think-theyre-the-leader) | Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it. |
| 03 | [Heartbeats: How Nodes Know Their Peers Are Alive](/heartbeats-how-nodes-know-their-peers-are-alive) | Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work. |
| 04 | [Leader Election: Agreeing on Who's in Charge](/leader-election-agreeing-on-whos-in-charge) | Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve. |
| 05 | [Consensus Algorithms: Agreeing on a Value Across Failures](/consensus-algorithms-agreeing-on-a-value-across-failures) | Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used. |
| 06 | [Quorum: How Many Nodes Must Agree?](/quorum-how-many-nodes-must-agree) | Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB. |
| 07 | [Paxos: The Algorithm That Started It All](/paxos-the-algorithm-that-started-it-all) | Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production. |
| 08 | [Raft: Consensus Made Understandable](/raft-consensus-made-understandable) | Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV. |
| 09 | [Gossip Protocol: Decentralised Cluster Communication](/gossip-protocol-decentralised-cluster-communication) | Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production. |
| 10 | [Logical Clocks: When Physical Time Isn't Enough](/logical-clocks-when-physical-time-isnt-enough) | Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works. |
| 11 | [Lamport Timestamps: Ordering Events Without a Global Clock](/lamport-timestamps-ordering-events-without-a-global-clock) | Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you. |
| 12 | [Vector Clocks: Knowing When Events Are Truly Concurrent](/vector-clocks-knowing-when-events-are-truly-concurrent) | Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations. |
| 13 | [Distributed Transactions: When One Machine Isn't Enough](/distributed-transactions-when-one-machine-isnt-enough) | Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative. |
| 14 | [Two-Phase Commit: Coordinating a Distributed Decision](/two-phase-commit-coordinating-a-distributed-decision) | 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems. |
| 15 | [Three-Phase Commit: Solving 2PC's Blocking Problem](/three-phase-commit-solving-2pcs-blocking-problem) | 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production. |
| 16 | [Delivery Semantics: What Does "Delivered" Actually Mean?](/delivery-semantics-what-does-delivered-actually-mean) | Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate. |
| 17 | [Change Data Capture: Streaming Your Database in Real Time](/change-data-capture-streaming-your-database-in-real-time) | CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it. |
| 18 | [Erasure Coding: Fault Tolerance Without Full Replication](/erasure-coding-fault-tolerance-without-full-replication) | Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication. |
| 19 | [Merkle Trees: Efficiently Finding What's Different](/merkle-trees-efficiently-finding-whats-different) | Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy. |
| 20 | [Observability: Understanding Your System at Runtime](/observability-understanding-your-system-at-runtime) | Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together. |
| 21 | [Distributed Systems: Wrap-Up](/distributed-systems-wrap-up) | A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series. |

* * *

# Network Partitions: The Failure Mode You Can't Design Away

## The problem

Three data centres. A fibre cable gets accidentally cut during construction work. For 47 seconds, the machines in data centre A can't reach the machines in data centres B and C. The machines in A are fine. The machines in B and C are fine. The network between them is not.

During those 47 seconds, every distributed system spanning the partition faces the same question: what do we do?

Do we keep accepting requests and risk serving inconsistent data? Do we refuse requests and guarantee consistency at the cost of availability? There is no option that avoids all harm — the partition forces a choice.

* * *

## The core idea

A network partition is a failure in which a subset of nodes in a distributed system cannot communicate with another subset, even though both subsets may be individually healthy. Partitions are not hypothetical — they happen regularly due to network hardware failure, misconfiguration, routing changes, or even software bugs in network stacks. Any distributed system must have an explicit policy for what it does during a partition.

* * *

## The analogy: a committee split across two rooms

A committee of seven members normally meets in one room and votes on decisions. A fire alarm evacuates them into two separate rooms with no communication between them. Both rooms want to continue working.

If both rooms make decisions independently, they may contradict each other — Committee A votes to approve project X while Committee B votes to reject it. The organisation is now in an inconsistent state.

If both rooms wait for the fire to be resolved before deciding anything, no work gets done during the alarm — but the decisions that are eventually made will be consistent.

This is the partition choice, exactly: serve potentially inconsistent results (continue working), or stop and wait for the partition to heal (preserve consistency).

* * *

## How network partitions work (interactive diagram)

%%[partition-cap-choice-widget] 

### What causes a partition

*   A network switch, router, or cable fails
    
*   A firewall misconfiguration drops traffic between subnets
    
*   A data centre's upstream connection is severed (backhoe through a fibre conduit)
    
*   A software bug causes a node to stop responding to heartbeats (appears partitioned even if the network is fine)
    
*   A cloud provider has a zone failure that isolates one availability zone from others
    

### The CAP theorem revisited

The CAP theorem (covered in Foundations, Pillar 1) states that in the presence of a network partition, a distributed system must choose between:

*   **Consistency (C):** every read receives the most recent write or an error
    
*   **Availability (A):** every request receives a response (not necessarily the most recent write)
    

You cannot have both during a partition. This isn't a design choice you can engineer your way out of — it's a mathematical consequence of the fact that two nodes that can't communicate cannot simultaneously guarantee both properties.

**CP systems** (prefer consistency over availability during a partition): refuse requests from the partitioned minority, or stop accepting writes entirely, until the partition heals. ZooKeeper, etcd, HBase.

**AP systems** (prefer availability over consistency during a partition): continue serving requests on both sides of the partition, accepting that the two sides may diverge. Cassandra, DynamoDB, CouchDB.

### PACELC: a more nuanced model

%%[partition-pacelc-widget] 

CAP only addresses behaviour during a partition. PACELC extends it: even when the system is not partitioned (normal operation), there's a tradeoff between latency and consistency.

```plaintext
PACELC:
  During a Partition: choose between Availability and Consistency
  Else (no partition): choose between Latency and Consistency
```

A system that requires synchronous replication before acknowledging a write is consistent but slow. A system that acknowledges writes before replication is fast but may briefly serve stale reads.

Most production systems fall somewhere between the extremes — they tune consistency and latency tradeoffs via replication factors, consistency levels, and quorum sizes.

### Partition behaviour: the URL shortener example

During the 47-second partition:

**Redis Cluster (AP-leaning):** slots owned by nodes on side A continue serving reads and writes. Slots owned by nodes on side B do the same. If a key was updated on side A, side B serves the old value. After healing, last-write-wins resolves conflicts — some writes are lost.

**PostgreSQL primary-replica (CP-leaning):** the primary is on side A. Replicas on side B lose contact with the primary. The synchronous replica on B cannot be promoted without quorum from A (using Patroni with proper fencing). Writes continue to the primary on side A. B-side replicas serve stale reads until the partition heals.

**Cassandra (AP, tunable):** with `CONSISTENCY = QUORUM` (requires 2 of 3 replicas), nodes on side A (2 nodes) can still satisfy quorum reads and writes. Side B (1 node) cannot satisfy quorum and returns an error. With `CONSISTENCY = ANY`, both sides accept writes — divergence occurs.

**Redirect service instances:** stateless — they route to whichever cache/database they can reach. A-side instances serve from Redis and PostgreSQL primary (consistent). B-side instances may serve from replicas (slightly stale). After healing, the system self-corrects.

### Partition healing

%%[partition-healing-widget] 

When connectivity is restored, systems must reconcile diverged state:

**Last-write-wins (LWW):** the most recent timestamp wins. Simple, but requires clocks to be synchronised (which they never perfectly are in distributed systems). Cassandra and DynamoDB use LWW.

**Vector clocks / version vectors:** detect concurrent writes and surface conflicts to the application for resolution. Riak and some Dynamo implementations use this approach.

**Read repair:** when a read request is served, the coordinator checks all replicas and updates the ones that have stale values. Cassandra does this transparently.

**Anti-entropy processes:** background processes periodically compare replica state using Merkle trees and synchronise diverged data. Covered in post 19.

* * *

## Tradeoffs

**There is no partition-tolerant system that avoids the CP/AP choice.** The choice is not whether to be affected by partitions — every distributed system is — but what the system does during one. The right choice depends on the use case:

*   **Financial transactions, coordination systems, configuration stores** → CP: a wrong answer is worse than no answer
    
*   **Shopping carts, social media counters, DNS, URL shortener redirects** → AP: an approximate answer is better than an error
    

**Partition frequency vs severity.** Short partitions (seconds) that heal quickly have different consequences than long partitions (minutes or hours). A system that queues writes during a short partition and flushes when it heals may be effectively both consistent and available for typical partition durations.

* * *

## The one thing to remember

> **Network partitions are not hypothetical — they happen regularly in any system spanning multiple machines or data centres.** Every distributed system must have an explicit policy: serve potentially stale data (AP), or refuse requests until the partition heals (CP). This is not a flaw — it's a fundamental property of systems that run across unreliable networks. The CAP theorem doesn't tell you which to choose; it tells you that you must choose, and that the choice has consequences.

* * *

*← Previous:* [***Distributed Systems: What Happens When Machines Disagree***](/distributed-systems-what-happens-when-machines-disagree) *— Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and ob...*

*→ Next:* [***Split-Brain***](/split-brain-when-two-nodes-both-think-theyre-the-leader) *— what happens when a partition causes two nodes to simultaneously believe they are the leader.*