# Raft: Consensus Made Understandable

> **Series:** System Design · Distributed Systems — Pillar 8 of 8

## Systems Design

| # | Post | What it covers |
| --- | --- | --- |
| 00 | [Distributed Systems: What Happens When Machines Disagree](/distributed-systems-what-happens-when-machines-disagree) | Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar. |
| 01 | [Network Partitions: The Failure Mode You Can't Design Away](/network-partitions-the-failure-mode-you-cant-design-away) | Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice. |
| 02 | [Split-Brain: When Two Nodes Both Think They're the Leader](/split-brain-when-two-nodes-both-think-theyre-the-leader) | Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it. |
| 03 | [Heartbeats: How Nodes Know Their Peers Are Alive](/heartbeats-how-nodes-know-their-peers-are-alive) | Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work. |
| 04 | [Leader Election: Agreeing on Who's in Charge](/leader-election-agreeing-on-whos-in-charge) | Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve. |
| 05 | [Consensus Algorithms: Agreeing on a Value Across Failures](/consensus-algorithms-agreeing-on-a-value-across-failures) | Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used. |
| 06 | [Quorum: How Many Nodes Must Agree?](/quorum-how-many-nodes-must-agree) | Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB. |
| 07 | [Paxos: The Algorithm That Started It All](/paxos-the-algorithm-that-started-it-all) | Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production. |
| 08 | **Raft: Consensus Made Understandable** ← you are here | Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV. |
| 09 | [Gossip Protocol: Decentralised Cluster Communication](/gossip-protocol-decentralised-cluster-communication) | Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production. |
| 10 | [Logical Clocks: When Physical Time Isn't Enough](/logical-clocks-when-physical-time-isnt-enough) | Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works. |
| 11 | [Lamport Timestamps: Ordering Events Without a Global Clock](/lamport-timestamps-ordering-events-without-a-global-clock) | Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you. |
| 12 | [Vector Clocks: Knowing When Events Are Truly Concurrent](/vector-clocks-knowing-when-events-are-truly-concurrent) | Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations. |
| 13 | [Distributed Transactions: When One Machine Isn't Enough](/distributed-transactions-when-one-machine-isnt-enough) | Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative. |
| 14 | [Two-Phase Commit: Coordinating a Distributed Decision](/two-phase-commit-coordinating-a-distributed-decision) | 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems. |
| 15 | [Three-Phase Commit: Solving 2PC's Blocking Problem](/three-phase-commit-solving-2pcs-blocking-problem) | 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production. |
| 16 | [Delivery Semantics: What Does "Delivered" Actually Mean?](/delivery-semantics-what-does-delivered-actually-mean) | Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate. |
| 17 | [Change Data Capture: Streaming Your Database in Real Time](/change-data-capture-streaming-your-database-in-real-time) | CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it. |
| 18 | [Erasure Coding: Fault Tolerance Without Full Replication](/erasure-coding-fault-tolerance-without-full-replication) | Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication. |
| 19 | [Merkle Trees: Efficiently Finding What's Different](/merkle-trees-efficiently-finding-whats-different) | Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy. |
| 20 | [Observability: Understanding Your System at Runtime](/observability-understanding-your-system-at-runtime) | Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together. |
| 21 | [Distributed Systems: Wrap-Up](/distributed-systems-wrap-up) | A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series. |

* * *

# Raft: Consensus Made Understandable

## The problem

Paxos is correct. It is also notoriously difficult to understand, implement correctly, and explain to engineers who need to reason about the systems built on it. In a 2014 research paper, Diego Ongaro and John Ousterhout described this plainly: they surveyed graduate students and found Paxos was widely considered "opaque and difficult." They designed Raft with one primary goal: understandability.

The result is an algorithm that provides the same safety guarantees as Paxos with a structure that engineers can explain, implement, and debug. Raft is now the most widely used consensus algorithm in production systems: etcd, CockroachDB, TiKV, Consul, and CockroachDB all use Raft. Kubernetes depends on etcd. Half the distributed databases shipped in the last decade depend on Raft.

%%[raft-properties-widget]

* * *

## The core idea

Raft decomposes consensus into three relatively independent subproblems: **leader election** (choosing one node to coordinate), **log replication** (the leader accepts entries and replicates them to followers), and **safety** (ensuring no two nodes ever commit different values at the same log index). Each subproblem is tractable independently; together they provide a complete consensus protocol.

* * *

## The analogy: a parliamentary legislature with a speaker

A parliament has a Speaker (the leader) who controls the floor. Only the Speaker can introduce legislation (log entries). When the Speaker proposes a bill, members vote to pass it or not. If the majority pass it, it becomes law (committed). If the Speaker is absent, parliament elects a new Speaker before proceeding.

The legislature's rule: all laws are passed in the order the Speaker introduced them. No law is rescinded once passed. If the Speaker changes, the new Speaker inherits all previously passed laws and introduces new ones from where the old Speaker left off.

This is Raft's model, exactly.

* * *

## How Raft works

### Terms

Raft uses **terms** — monotonically increasing logical time units — to distinguish legitimate leaders from stale ones. Each election starts a new term. A node that receives a message with a higher term immediately updates its term and reverts to follower. This prevents an old leader from reasserting authority after recovering from a partition.

### Leader election

All nodes start as **followers**. A follower that doesn't receive a heartbeat from the leader within `electionTimeout` becomes a **candidate** and starts an election:

1.  Increments its term
    
2.  Votes for itself
    
3.  Sends `RequestVote` RPCs to all other nodes
    

A node grants its vote if: it hasn't voted in this term yet, AND the candidate's log is at least as up-to-date as its own (more recent last entry term, or same term and longer log).

The candidate wins if it receives votes from a majority (⌊N/2⌋ + 1 of N nodes, including itself). It immediately begins sending heartbeats to prevent new elections.

If no candidate wins (split vote), the election times out and a new election starts with a higher term. Randomised election timeouts (each node picks a random delay before starting a candidacy) break ties and ensure elections converge quickly.

### Log replication

The leader accepts client commands and appends them to its log as new entries. Each entry has an index, term number, and the command.

The leader sends `AppendEntries` RPCs to all followers with the new entry. Followers append the entry if it's consistent with their log (the entry before it matches). The leader waits for a majority of nodes to acknowledge the entry.

Once a majority acknowledge, the entry is **committed** — it will survive any future leader changes. The leader applies the entry to its state machine and returns the result to the client.

```plaintext
Log:
Index: 1    2    3    4    5
Term:  1    1    2    2    2
Cmd:   w1   w2   w3   w4   w5 (uncommitted)
                  ↑
           commit index (majority acknowledged through here)
```

Followers learn the commit index via subsequent `AppendEntries` RPCs and apply committed entries to their state machines.

### Log consistency guarantee

Raft maintains a critical invariant: **if two logs have the same term and index for an entry, all preceding entries are identical.** This is enforced by the consistency check in `AppendEntries`: before accepting a new entry, the follower verifies that the entry immediately before it matches (same term, same index). If not, it rejects the append — the leader re-sends earlier entries until consistency is established.

### Safety: the log matching property

A key safety property: **Raft never allows two different committed entries at the same log index.** This follows from:

1.  Only one leader per term (from the election guarantee)
    
2.  A leader never overwrites committed entries — it only appends
    
3.  The vote restriction: a candidate can only win if its log is at least as up-to-date as any voter's — ensuring the winner has all committed entries
    

These three properties together guarantee no two nodes ever commit different values at the same index.

### Leader changes and log reconciliation

When a new leader is elected, its log may be behind followers' logs (it wasn't the most recent leader). Raft's approach: the new leader's log is authoritative. Followers that have entries beyond the leader's log have those entries overwritten (they were never committed — committed entries survive leader changes; uncommitted ones may not).

The leader sends its log to each follower, which the follower matches against its own and overwrites any divergent tail.

* * *

## Raft in etcd: the Kubernetes brain

Kubernetes stores all cluster state in etcd — pod specs, service definitions, configuration, secrets. etcd uses Raft for its replicated log. Every etcd write goes through Raft's commit process: proposed to the leader, replicated to followers, committed when a majority acknowledge.

An etcd cluster of 3 nodes tolerates 1 failure. 5 nodes tolerates 2 failures. The API server, scheduler, and controller manager all read from etcd. If etcd loses quorum (more than half its nodes fail), Kubernetes cannot make decisions — scheduling stops, deployments freeze.

This is the price of strong consistency: etcd will never return incorrect data, but it may be unavailable if enough nodes fail.

* * *

## Tradeoffs

**Leader is the bottleneck.** All writes route through the leader. For high-write systems, a single Raft group's throughput is limited by one node's capacity. Systems like CockroachDB and TiKV shard data across many independent Raft groups (one per key range) to parallelise write throughput.

**Quorum requirement limits availability.** A Raft group of 3 requires 2 to be reachable. If one node is in a slow availability zone and another crashes, the cluster loses quorum and stops accepting writes. Operating in 3 availability zones with 3 nodes is the typical pattern for tolerable availability with strong consistency.

**Read linearity requires care.** A stale read from a follower is possible — the follower may not have applied the latest committed entries. Linearisable reads route through the leader or use a read lease mechanism.

* * *

## The one thing to remember

> **Raft achieves consensus through an elected leader that replicates log entries to a majority of followers before committing.** Safety comes from three rules: only one leader per term, leaders never overwrite committed entries, and candidates must have an up-to-date log to be elected. Raft is the most important consensus algorithm to understand in practice — it directly powers Kubernetes (via etcd), CockroachDB, TiKV, and Consul. When you interact with any of these systems, you're interacting with Raft's guarantees.

* * *

*← Previous:* [***Paxos***](/paxos-the-algorithm-that-started-it-all) *— the foundational consensus algorithm, notoriously difficult to understand but the basis of many production systems.*

*→ Next:* [***Gossip Protocol***](/gossip-protocol-decentralised-cluster-communication) *— decentralised cluster membership and state propagation without a leader or consensus.*
