# Gossip Protocol: Decentralised Cluster Communication

> **Series:** System Design · Distributed Systems — Pillar 8 of 8

## Systems Design

| # | Post | What it covers |
| --- | --- | --- |
| 00 | [Distributed Systems: What Happens When Machines Disagree](/distributed-systems-what-happens-when-machines-disagree) | Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar. |
| 01 | [Network Partitions: The Failure Mode You Can't Design Away](/network-partitions-the-failure-mode-you-cant-design-away) | Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice. |
| 02 | [Split-Brain: When Two Nodes Both Think They're the Leader](/split-brain-when-two-nodes-both-think-theyre-the-leader) | Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it. |
| 03 | [Heartbeats: How Nodes Know Their Peers Are Alive](/heartbeats-how-nodes-know-their-peers-are-alive) | Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work. |
| 04 | [Leader Election: Agreeing on Who's in Charge](/leader-election-agreeing-on-whos-in-charge) | Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve. |
| 05 | [Consensus Algorithms: Agreeing on a Value Across Failures](/consensus-algorithms-agreeing-on-a-value-across-failures) | Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used. |
| 06 | [Quorum: How Many Nodes Must Agree?](/quorum-how-many-nodes-must-agree) | Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB. |
| 07 | [Paxos: The Algorithm That Started It All](/paxos-the-algorithm-that-started-it-all) | Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production. |
| 08 | [Raft: Consensus Made Understandable](/raft-consensus-made-understandable) | Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV. |
| 09 | **Gossip Protocol: Decentralised Cluster Communication** ← you are here | Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production. |
| 10 | [Logical Clocks: When Physical Time Isn't Enough](/logical-clocks-when-physical-time-isnt-enough) | Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works. |
| 11 | [Lamport Timestamps: Ordering Events Without a Global Clock](/lamport-timestamps-ordering-events-without-a-global-clock) | Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you. |
| 12 | [Vector Clocks: Knowing When Events Are Truly Concurrent](/vector-clocks-knowing-when-events-are-truly-concurrent) | Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations. |
| 13 | [Distributed Transactions: When One Machine Isn't Enough](/distributed-transactions-when-one-machine-isnt-enough) | Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative. |
| 14 | [Two-Phase Commit: Coordinating a Distributed Decision](/two-phase-commit-coordinating-a-distributed-decision) | 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems. |
| 15 | [Three-Phase Commit: Solving 2PC's Blocking Problem](/three-phase-commit-solving-2pcs-blocking-problem) | 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production. |
| 16 | [Delivery Semantics: What Does "Delivered" Actually Mean?](/delivery-semantics-what-does-delivered-actually-mean) | Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate. |
| 17 | [Change Data Capture: Streaming Your Database in Real Time](/change-data-capture-streaming-your-database-in-real-time) | CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it. |
| 18 | [Erasure Coding: Fault Tolerance Without Full Replication](/erasure-coding-fault-tolerance-without-full-replication) | Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication. |
| 19 | [Merkle Trees: Efficiently Finding What's Different](/merkle-trees-efficiently-finding-whats-different) | Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy. |
| 20 | [Observability: Understanding Your System at Runtime](/observability-understanding-your-system-at-runtime) | Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together. |
| 21 | [Distributed Systems: Wrap-Up](/distributed-systems-wrap-up) | A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series. |

* * *

# Gossip Protocol: Decentralised Cluster Communication

## The problem

Your Cassandra cluster has 100 nodes. Each node needs to know the health status of every other node — which are alive, which are dead, which are handling which token ranges. With 100 nodes, that's each node needing updates from 99 others.

A centralised approach: one coordinator node receives status from all nodes and broadcasts updates. The coordinator is a single point of failure and a bottleneck for a 100-node cluster.

A direct broadcast: each node broadcasts its status to all 99 others. With 100 nodes sending 99 messages each per second, that's 9,900 messages per second — fine at 100 nodes, but quadratically worse as the cluster grows. A 1,000-node cluster would generate nearly a million messages per second just for status updates.

Gossip protocols solve this with a different model: each node only talks to a few random peers, but information spreads exponentially fast across the cluster through the chain of random communications.

* * *

## The core idea

A gossip protocol (also called epidemic protocol) propagates information through a cluster by having each node periodically select a small number of random peers and exchange information with them. Like a rumour spreading through a social network, information reaches every node in O(log N) rounds — logarithmically fast — without any central coordinator and without each node communicating with all others directly.

* * *

## The analogy: rumour spreading through a school

A rumour starts with one student. Every few minutes, each student who knows the rumour tells a randomly selected other student. After one round, 2 students know. After two rounds, ~4 know. After three rounds, ~8. After about 7 rounds (log₂(100)), essentially everyone knows.

No one is in charge. No one coordinates who tells whom. The rumour spreads naturally through random pairwise contact. This is gossip: decentralised, scalable, resilient. If a few students are absent, the rumour still spreads — it just takes one extra round to route around them. Check our interactive diagram below:



* * *

## How gossip works

### The basic mechanism

Every T seconds (typically 1 second), each node:

1.  Selects 1–3 random peers from its known cluster membership
    
2.  Sends its current state vector to those peers
    
3.  Receives their state in return
    
4.  Updates its local view of cluster state with anything newer than what it already has
    

```plaintext
Node A gossips with Node B:
  A sends: {A: (alive, generation=5, version=42), C: (alive, gen=3, ver=19)}
  B sends: {B: (alive, gen=4, ver=31), C: (alive, gen=3, ver=20), D: (suspected, gen=2)}

After exchange:
  A now knows: B is at version 31, C is at version 20 (newer than 19), D is suspected
  B now knows: A is at version 42
```

The **version number** (or generation+version pair) identifies the currency of each node's information about every other node. Higher version = more recent.

### Convergence rate

In a cluster of N nodes with each node gossiping with k peers per round, new information reaches all nodes within O(log\_k(N)) rounds. For k=3 and N=100: log₃(100) ≈ 4.2 rounds. With 1-second gossip intervals, the entire cluster learns new state within ~5 seconds.

This is remarkably efficient: 100 nodes, each sending 3 messages per second = 300 total messages per second for cluster-wide state propagation — constant, regardless of cluster size growth to moderate scales.

### Failure detection via gossip

Cassandra uses gossip for failure detection in combination with heartbeats:

*   Each node gossips its own heartbeat counter (incrementing every second) to its peers
    
*   If a peer's heartbeat counter hasn't increased in the expected time, its "phi" suspicion level rises (the phi accrual detector from post 03)
    
*   When φ exceeds a threshold, the node is marked as suspected, then down
    

Marking a node down propagates through gossip: node A marks B as down, gossips to C and D, who update their views and gossip to E and F. Within seconds, the whole cluster considers B down.

### What gossip is used for

**Cluster membership:** which nodes are in the cluster, which are healthy, which are joining or leaving. Cassandra, Redis Cluster, Consul.

**Ring topology in consistent hashing:** when a node joins or leaves the Cassandra cluster, its token assignments propagate via gossip. Within seconds, all nodes know the updated ring.

**Data anti-entropy (Cassandra):** Cassandra periodically gossips about Merkle tree hashes of its data to identify and repair inconsistencies between replicas (covered in post 19).

**Service health in Consul:** Consul agents gossip health check results. A failing service's unhealthiness propagates to all agents within seconds without any central coordinator.

* * *

## Tradeoffs

**Eventual consistency of the cluster view.** Gossip doesn't provide strong consistency — there's always a brief window where different nodes have different views of cluster state. A node that was just marked down may still be in some nodes' "alive" view for a few seconds. This is acceptable for membership and health propagation; it's not suitable for consensus or coordination decisions (use Raft/Paxos for those).

**Bandwidth scales with state size, not cluster size.** Each gossip message carries the state of the sender's view of the cluster. If each node stores metadata about N other nodes, gossip messages grow with N. For very large clusters (thousands of nodes), this can become expensive. Solutions: cap the amount of state included per gossip round, or use hierarchical gossip (gossip within a zone, then between zones).

**No total ordering.** Gossip delivers information eventually but not in any guaranteed order. If two events happen simultaneously on different nodes, different nodes may learn about them in different orders. This is fine for membership changes; it's not acceptable for ordered log replication.

* * *

## The one thing to remember

> **Gossip protocols spread information across a cluster in O(log N) rounds by having each node exchange state with a few random peers — no central coordinator, no single point of failure.** Each individual exchange is cheap; the collective effect is rapid, resilient propagation. Cassandra, Redis Cluster, and Consul all use gossip for cluster membership because it scales naturally, tolerates node failures, and requires no dedicated coordination infrastructure. The cost: eventual consistency of the cluster view, not instantaneous agreement.

* * *

*← Previous:* [***Raft***](/raft-consensus-made-understandable) *— the consensus algorithm designed for understandability; the most important one to know in practice.*

*→ Next:* [***Logical Clocks***](/logical-clocks-when-physical-time-isnt-enough) *— in distributed systems, physical clocks can't be trusted; logical clocks capture causal relationships between events instead.*
