Distributed Systems: What Happens When Machines Disagree

Series: System Design · Distributed Systems — Pillar 8 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Distributed Systems: What Happens When Machines Disagree ← you are here | Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar. |
| 01 | Network Partitions: The Failure Mode You Can't Design Away | Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice. |
| 02 | Split-Brain: When Two Nodes Both Think They're the Leader | Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it. |
| 03 | Heartbeats: How Nodes Know Their Peers Are Alive | Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work. |
| 04 | Leader Election: Agreeing on Who's in Charge | Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve. |
| 05 | Consensus Algorithms: Agreeing on a Value Across Failures | Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used. |
| 06 | Quorum: How Many Nodes Must Agree? | Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB. |
| 07 | Paxos: The Algorithm That Started It All | Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production. |
| 08 | Raft: Consensus Made Understandable | Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV. |
| 09 | Gossip Protocol: Decentralised Cluster Communication | Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production. |
| 10 | Logical Clocks: When Physical Time Isn't Enough | Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works. |
| 11 | Lamport Timestamps: Ordering Events Without a Global Clock | Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you. |
| 12 | Vector Clocks: Knowing When Events Are Truly Concurrent | Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations. |
| 13 | Distributed Transactions: When One Machine Isn't Enough | Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative. |
| 14 | Two-Phase Commit: Coordinating a Distributed Decision | 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems. |
| 15 | Three-Phase Commit: Solving 2PC's Blocking Problem | 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production. |
| 16 | Delivery Semantics: What Does "Delivered" Actually Mean? | Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate. |
| 17 | Change Data Capture: Streaming Your Database in Real Time | CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it. |
| 18 | Erasure Coding: Fault Tolerance Without Full Replication | Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication. |
| 19 | Merkle Trees: Efficiently Finding What's Different | Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy. |
| 20 | Observability: Understanding Your System at Runtime | Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together. |
| 21 | Distributed Systems: Wrap-Up | A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series. |
Distributed Systems: What Happens When Machines Disagree
The scenario
Your URL shortener runs across three data centres. The redirect service has thirty instances. The database has a primary and two replicas. There are six Cassandra nodes in the click events cluster. Redis runs as a six-node cluster.
On a Tuesday afternoon, a network switch in one data centre fails. For eleven seconds, the machines in that data centre can't reach the machines in the other two. Eleven seconds. During those eleven seconds:
- Which Redis nodes are the authoritative primaries?
- Do the Cassandra nodes that can't communicate with each other still accept writes?
- Does the PostgreSQL replica that was promoted to primary during the partition stay primary after the partition heals, or does the old primary try to reclaim its role?
- Two instances of the redirect service both tried to write an analytics record for the same click event at the same time — which one wins?
- A request was acknowledged before the partition; did the acknowledgement actually reach the client?
None of these questions are easy. Each one has an answer that involves careful tradeoffs between availability, consistency, and partition tolerance — the CAP theorem from Pillar 1. This final pillar is about the mechanisms that answer these questions in production systems.
Twenty concepts across six themes:
- Failure detection — how nodes know that other nodes have failed: Network Partitions, Split-Brain, Heartbeats
- Coordination — how distributed systems elect leaders and reach agreement: Leader Election, Consensus Algorithms, Quorum, Paxos, Raft
- Communication at scale — how nodes propagate information without centralised coordination: Gossip Protocol
- Time and ordering — how distributed systems reason about causality when clocks disagree: Logical Clocks, Lamport Timestamps, Vector Clocks
- Atomicity across machines — how distributed transactions work and why they're hard: Distributed Transactions, Two-Phase Commit, Three-Phase Commit, Delivery Semantics
- Data integrity and observability — how data is validated, replicated efficiently, and monitored: Change Data Capture, Erasure Coding, Merkle Trees, Observability
TL;DR: Network partitions are not hypothetical — they happen regularly, and systems must decide whether to stay available (serve potentially stale data) or stay consistent (refuse requests until partition heals). Consensus algorithms (Raft, Paxos) solve the problem of multiple nodes agreeing on a value despite failures, at the cost of latency and quorum requirements. Distributed transactions are expensive; most systems accept eventual consistency using delivery semantics and idempotency instead of guaranteeing atomicity. Observability — structured logs, metrics, and distributed traces — is not optional in a distributed system; it's how you know what the system is doing when you can't reason about it from first principles.
What this pillar covers
Failure detection
Network Partitions — the inevitable reality that network links fail, routers drop packets, and data centres become unreachable. How systems behave during a partition defines their consistency and availability properties.
Split-Brain — the dangerous scenario where a partition causes two nodes to simultaneously believe they are the leader or primary, both accepting writes independently. How systems detect and prevent it.
Heartbeats — the mechanism by which nodes signal liveness to each other. The basis of failure detection in every distributed database, coordinator, and orchestration system.
Coordination
Leader Election — the process by which a cluster of nodes agrees on exactly one node to act as the primary or coordinator. What happens when the leader fails and a new one must be chosen.
Consensus Algorithms — the family of algorithms (Paxos, Raft, Zab) that allow a group of nodes to agree on a value despite a minority of failures. The foundation of replicated state machines.
Quorum — the minimum number of nodes that must agree for an operation to succeed. How choosing the right quorum size balances availability against consistency.
Paxos — the foundational consensus algorithm. Notoriously difficult to understand but the basis for many production systems (Google Chubby, Zookeeper, CockroachDB).
Raft — the consensus algorithm designed for understandability. Powers etcd, CockroachDB, TiKV. The most important consensus algorithm to understand in practice.
Communication at scale
Gossip Protocol — a decentralised communication mechanism where nodes randomly exchange information with peers, eventually propagating updates to the entire cluster. Used by Cassandra, Redis Cluster, and Consul for cluster membership and state propagation.
Time and ordering
Logical Clocks — the conceptual foundation: physical clocks in distributed systems can't be trusted to agree, so logical clocks track causal relationships between events instead.
Lamport Timestamps — the simplest logical clock. Assigns a monotonically increasing integer to each event such that if A causes B, A's timestamp is less than B's.
Vector Clocks — extend Lamport timestamps to detect concurrent events (events with no causal relationship). Used by Dynamo, Riak, and distributed version control.
Atomicity across machines
Distributed Transactions — the problem of atomically updating data across multiple nodes or services. Why it's hard, what guarantees different approaches provide, and when to avoid them entirely.
Two-Phase Commit (2PC) — the classical protocol for distributed atomicity. A coordinator asks all participants to prepare, then commits or aborts based on their responses. Reliable but blocks on coordinator failure.
Three-Phase Commit (3PC) — extends 2PC to avoid blocking on coordinator failure by adding a pre-commit phase. Safer but rarely used in practice due to network assumptions.
Delivery Semantics — at-most-once, at-least-once, and exactly-once delivery: what each guarantees, what it costs, and why exactly-once is harder than it sounds.
Data integrity and observability
Change Data Capture (CDC) — streaming changes from a database's write-ahead log to downstream consumers. The mechanism behind real-time data pipelines, event sourcing, and the Outbox pattern's relay.
Erasure Coding — a method for storing data redundantly across multiple nodes using mathematical encoding, such that the data can be reconstructed from any subset of the encoded pieces. More storage-efficient than replication.
Merkle Trees — a tree of hash values where each parent is a hash of its children. Used to efficiently compare two large datasets and identify which portions differ, without comparing every record.
Observability — structured logging, metrics, and distributed tracing. How distributed systems are understood and debugged at runtime. Without observability, a distributed system is a black box.
The URL shortener, finally distributed
This is the final pillar. The URL shortener now runs across multiple data centres with every pattern from Pillars 1–7 applied. Pillar 8 reveals what happens in the gaps — the eleven-second partition, the primary that failed over but came back, the two workers that processed the same message twice, the Cassandra replica that was 30 seconds behind when the primary went down.
Every mechanism in this pillar exists because distributed systems fail in ways that single machines don't. Understanding these mechanisms is what separates engineers who can reason about distributed system behaviour from those who can only observe it after the fact.
→ Next: Network Partitions — what happens when nodes can't communicate




