Cloud Tuned

Docker & Kubernetes: What They Are, Why They Matter, and How to Get Started

Cloud Tuned — Sun, 16 Mar 2025 23:30:00 GMT

Introduction to Rancher: Wrangling Kubernetes Clusters at Scale

Cloud Tuned — Sat, 15 Mar 2025 23:30:00 GMT

Introduction to Rancher: Wrangling Kubernetes Clusters at Scale

Managing one Kubernetes cluster is a challenge. Managing a dozen of them — across different clouds, data centres, and edge locations — can feel like herding cattle. That's exactly the problem Rancher was built to solve.

What Is Rancher?

Rancher is a free, open-source tool created by Rancher Labs (acquired by SUSE in 2020) that lets you manage multiple Kubernetes clusters from a single place. If Kubernetes orchestrates and deploys containers, then Rancher does the same thing — one level up — for Kubernetes clusters themselves.

It supports clusters running in public clouds (AWS EKS, Azure AKS, Google GKE), on-premises data centres, hybrid environments, and even Internet of Things (IoT) devices. Rancher is also a Cloud Native Computing Foundation (CNCF)-compliant tool, meaning it works with virtually any standard Kubernetes distribution.

The name isn't an accident. There's a well-known saying in cloud infrastructure: "servers are cattle, not pets." When you've got a lot of cattle, you need a rancher to manage the herd.

Why Use Rancher?

The central value proposition is centralised multi-cluster management, but it goes deeper than that:

Multi-cloud and hybrid cloud support — manage all your clusters to a single standard, regardless of where they live.
CNCF compliance — broad compatibility across the Kubernetes ecosystem.
Deep integration with managed Kubernetes services — Rancher can interact directly with cloud provider APIs alongside the Kubernetes API, so you keep the benefits of managed platforms while using Rancher.
Security at scale — implement access controls and compliance policies across every cluster from one console, without micromanaging each one individually.
CI/CD automation — connect Git repositories for automated, Git-based deployments to multiple clusters at once.

Core Features

Rancher's feature set breaks down into four main areas:

1. Cluster Explorer

A GUI console that gives you full visibility into your Kubernetes objects — namespaces, nodes, workloads (CronJobs, DaemonSets, Deployments, StatefulSets, Pods), service discovery, storage objects, and RBAC roles — all in one place. If it's a Kubernetes object, Cluster Explorer will surface it.

2. Continuous Delivery

Rancher's built-in CD tooling connects directly to Git repositories and automates deployments to specific clusters or cluster groups. You can manage Git-based workflows, define workspaces, and keep your delivery pipelines aligned across your entire infrastructure.

3. Apps & Marketplace

Built on Helm charts — YAML-based scripts for deploying complex applications on Kubernetes — the marketplace comes pre-loaded with catalogues for common services like Prometheus, Longhorn, and Nginx. You can also import third-party catalogues or build your own internal catalogue, making it easy for teams to deploy standardised services without reinventing the wheel each time.

4. Security

Rancher integrates with Kubernetes-native RBAC so you can manage user access across all clusters from one console. It supports pod security policies at an organisational level, and authenticates with a wide range of third-party identity providers — including Active Directory, GitHub, Okta, and Keycloak — so you can plug Rancher into whatever your organisation already uses.

The Rancher Ecosystem

Rancher Labs built more than just Rancher. A few closely related tools are worth knowing about:

Rancher Kubernetes Engine (RKE)

RKE is a CNCF-certified Kubernetes distribution that runs entirely inside Docker containers. Rather than manually installing and configuring Kubernetes — a notoriously complex process — RKE wraps that complexity into a straightforward setup that can run on any operating system capable of running Docker, whether bare-metal or virtualised servers.

Key use cases include:

Running Kubernetes in data centres where managed cloud services aren't available
Avoiding vendor lock-in by keeping clusters portable across cloud providers
Maintaining control over underlying infrastructure that managed services typically abstract away
Strict security environments — RKE2 (also called RKE Government) adds compliance with CIS benchmarks and FIPS requirements

K3s

K3s is a lightweight Kubernetes distribution purpose-built for IoT and edge computing. The name is a nod to Kubernetes (abbreviated K8s) — K3s is intentionally smaller.

The entire K3s binary is under 40MB and includes all dependencies needed to get a cluster running. It's optimised for low-compute environments, making it ideal for:

Resource-constrained hardware
Remote sites with limited connectivity (satellite offices, oil rigs)
IoT devices
Running production-grade Kubernetes on something as modest as a Raspberry Pi

K3s was donated by Rancher Labs to the CNCF in 2020 and is now an official CNCF project.

Longhorn

Longhorn is a cloud-native distributed block storage solution for Kubernetes. It tackles one of the trickier challenges in Kubernetes: persistent storage for stateful workloads.

Kubernetes nodes are designed to be ephemeral — they can fail at any moment without taking the cluster down. That's great for stateless workloads, but if you need data to persist, storing it only on a node that might disappear is a problem. Longhorn addresses this by replicating data across multiple nodes, so if one goes down, the data remains available.

Its advantages include cloud-native integration (backing up data to object storage, cross-availability zone failover), resilience (backup clusters separate from primary), and ease of use (one-click installation, live upgrades without downtime).

A Quick Tour of the Console

In practice, Rancher's interface is straightforward. From the main dashboard you can see all clusters under management at a glance. The Cluster Manager view gives you high-level cluster health and metadata, while Cluster Explorer lets you drill into the granular details of every Kubernetes object.

Continuous Delivery is accessible from the same navigation, where you paste in a Git repo URL, define the branches to watch, and configure deployment targets. Apps & Marketplace sits alongside it, offering a searchable catalogue of Helm charts ready to deploy with a few clicks. Security settings — authentication providers, user roles, and pod security policies — live under the Global section of the Cluster Manager.

Use Cases at a Glance

Scenario	How Rancher Helps
Large enterprise with multiple teams	Central shared services team manages all clusters; app teams keep their own
Multi-cloud or hybrid cloud	Unified management and standardisation across all environments
Strict compliance requirements	Organisation-wide pod security policies without manual cluster-by-cluster work
Edge and IoT deployments	K3s runs lightweight clusters on low-power, low-connectivity devices
GitOps and CI/CD	Continuous Delivery connects Git repos to automated cluster deployments

Getting Started

Because Rancher is open source, there's no licensing cost to get started. You can pull the Docker image and have an instance running locally in minutes. The official Rancher documentation is a solid next step, as is the Rancher community Slack for connecting with other users.

If you want to go deeper on the Kubernetes side, resources like the Kubernetes Deep Dive course or the Certified Kubernetes Administrator (CKA) exam preparation materials pair well with this foundation.

Rancher doesn't try to replace Kubernetes — it makes managing it at scale genuinely tractable. Whether you're running two clusters or twenty, across one cloud or five, it's a tool worth having in your infrastructure toolkit.

Networking Fundamentals: A Beginner's Guide to How the Internet Actually Works

Cloud Tuned — Fri, 14 Mar 2025 23:30:00 GMT

Networking Fundamentals: A Beginner's Guide to How the Internet Actually Works

You use computer networks every single day — to send emails, stream videos, join video calls, and browse the web. But what's actually happening behind the scenes? This guide breaks down the core concepts of computer networking from the ground up, no prior knowledge required.

What Is a Computer Network?

At its most basic, a computer network is nothing more than two computers connected together and communicating. That's it. Everything else — the internet, your home Wi-Fi, your office network — is just a scaled-up version of that simple idea.

Networks exist because they make our digital lives enormously more practical:

Exchanging data — sharing photos, videos, and files between devices
Sharing resources — one printer serving an entire office, rather than one per desk
Communication — email, instant messaging, video calls, and everything in between

A Brief History

Computer networking has come a long way in a short time:

Late 1960s — The Pentagon's ARPANET project connected four universities together for the first time, and the first email was sent
1970s–80s — The Ethernet standard and TCP/IP protocols were created, establishing the rules for how data should be transmitted and how computers should talk to each other
1980s — The Domain Name System (DNS) was invented to help the growing number of computers find each other; dial-up networking and AOL brought email to millions of homes
1990s — The World Wide Web arrived, along with web browsers
2000s — Broadband replaced dial-up; Amazon Web Services launched in 2006, kicking off the cloud era
Today — Cloud computing means companies no longer need to buy expensive equipment worldwide — they simply deploy their applications to the cloud and they're available everywhere

Types of Networks

Not all networks are the same size or scope. The two most common types are:

LAN — Local Area Network A network that connects devices within a small area: your home, a school, or an office building. Your home Wi-Fi is a LAN.

WAN — Wide Area Network A network that connects smaller networks (like LANs) across large geographic distances — between cities or countries. The internet itself is the world's largest WAN, made up of countless public and private networks all connected together.

Network Topologies: How Devices Are Arranged

The topology of a network describes how devices relate to each other — how they're physically cabled together and how data flows between them. Here are the main ones:

Topology	How It Works	Pros	Cons
Ring	Each device connects to two neighbours in a circle	Simple structure	If one device fails, the loop breaks
Point-to-Point	One device connected directly to one other	Fast communication	Only two devices
Mesh	Every device connects to every other device	Very redundant and fault-tolerant	Expensive due to cabling
Star	All devices connect to a central hub	Easy to manage	The hub is a single point of failure
Bus	All devices on one single cable	Works well for small networks	The main cable is a single point of failure
Tree	Devices arranged in branches	Flexible and scalable	If a branch's top device fails, everything below goes offline
Hybrid	Mix of two or more topologies	Very flexible for growth	Complex and expensive

In practice, most modern networks use a hybrid approach.

The Building Blocks of a Network

Every computer network is made up of five core components:

Devices — Computers, tablets, phones, gaming consoles. Each has a unique identifier called a MAC address that tells other devices exactly where to send traffic.
Connections — How devices physically connect: Ethernet cables, coax cable, or wirelessly.
Switches — Network equipment that connects the various devices on your local network to each other.
Routers — Network equipment that directs traffic between networks — from your home LAN out to the wider internet, for example.
Servers — Computers out on the internet that serve files, web pages, and applications.

The OSI Model: A Map of How Networks Communicate

The Open Systems Interconnection (OSI) Model is a theoretical framework that describes what happens to data as it travels from a hardware connection all the way up to an application on your screen — or vice versa.

It was developed in the 1980s to solve a real problem: how do devices from different manufacturers, running different software, communicate on the same network? The OSI model provides a common set of standards so that any device, from any vendor, can participate in a network.

It has seven layers, from the bottom (hardware) to the top (user-facing applications):

Layer	Name	What It Does	Example
7	Application	How your app interfaces with the network	HTTP, FTP, DNS
6	Presentation	Formats data so both sides can understand it	JPEG, MPEG, ASCII↔Binary
5	Session	Starts, manages, and stops communication sessions; keeps app data separate	NFS, SQL
4	Transport	Packages your message and ensures full delivery	TCP, UDP
3	Network	Routes data across LANs and WANs	IP addresses, Routers
2	Data Link	Packages data into frames for physical transmission; checks integrity	Network cards, Switches, ARP
1	Physical	The actual transmission of bits — electrical signals, radio waves, cables	Ethernet cables, Bluetooth, Wi-Fi

Layers 5–7 are called the upper layers (application-focused); layers 1–4 are the lower layers (hardware-focused).

Why does this matter in practice? When network engineers troubleshoot problems, they often refer to OSI layers. "It's a Layer 1 problem" means someone should check whether the cable is plugged in. "It's a Layer 3 problem" means the issue is with IP addressing or routing.

TCP vs UDP — What's the Difference?

Both are transport layer protocols (Layer 4), but they work differently:

TCP is a two-way agreement: "I'm going to send you something" → "Got it, send away" → data is exchanged. It guarantees delivery.
UDP is fire-and-forget: data is sent and may or may not arrive. Used for things like live video streaming where speed matters more than perfection.

IP Addresses: Your Device's Mailing Address

An IP address is a unique number that identifies a device on a network. Think of it like a mailing address — it's how other computers find and connect to yours.

A full IP address has three parts:

Network ID — like the city and street in a postal address; tells routers which network to send data to
Host ID — like the building number; identifies the specific device
Port — like the door of the building; tells traffic which application or service to go to (written as :80 after the IP)

IPv4 vs IPv6

IPv4 (e.g. 192.168.2.14) has been around since the 1980s. It uses 32-bit addresses, giving roughly 4.3 billion possible addresses — a number that's rapidly running out as more devices come online.

IPv6 (e.g. 2001:0db8:85a3:0000:0000:8a2e:0370:7334) uses 128-bit addresses, providing 340 trillion trillion trillion addresses. It became the standard in 2017 and is where networking is headed.

NAT — Network Address Translation

At home, your ISP gives you a single public IP address. But you likely have many devices — phone, laptop, TV, tablet. NAT is what makes this work: it routes traffic from your single public IP to the correct device on your private network. Think of it like a mail clerk at a large office building — all mail arrives at one point, and the clerk delivers it to the right person.

Ports: Different Doors for Different Traffic

There are 65,535 ports. The first 1,024 are reserved for well-known services:

Port	Service
25	SMTP (email)
53	DNS
80	HTTP (web pages)
443	HTTPS (secure web pages)

Subnetting

Subnetting is dividing a large network into smaller, more manageable networks using a subnet mask. Smaller networks are easier to manage, use IP addresses more efficiently, and generally perform better because routers know exactly which data goes where.

DNS: The Phone Book of the Internet

Remembering IP addresses for every website you visit would be impossible. DNS (Domain Name System) solves this by translating human-friendly domain names (like google.com) into IP addresses that computers use.

How a DNS Query Works

When you type www.example.com into your browser, here's what happens — faster than you can blink:

Your device asks a Recursive Name Server (usually provided by your ISP): "What's the IP for www.example.com?"
The recursive server doesn't know, so it asks a Root Name Server: "Who handles .com domains?"
The root server points it to the TLD (Top-Level Domain) Server for .com
The TLD server points it to the Authoritative Name Server for example.com
The authoritative server knows the answer and returns the IP address
The recursive server passes that IP back to your device — and remembers it for a while in case you need it again

This whole process takes milliseconds and happens every time you visit a website.

Domain Name Structure

A full domain name has several levels, read right to left:

www  .  example  .  com  .
 ↑         ↑        ↑    ↑
Sub-    Second-   Top-  Root
domain   level   level

Routing: How Data Gets From Here to There

Routing is the process of moving data from one network to another. It's done by routers, which forward data packets and decide the best path for them to take — like a GPS that constantly recalculates based on traffic conditions.

How Data Actually Travels

When you visit a website, your data doesn't travel as one big chunk. It gets broken into small packets, each stamped with:

A source address (your IP)
A destination address (the website's IP)
A sequence number so the destination can reassemble them in the right order, even if they arrive out of sequence

Routers pass these packets along — each one potentially choosing a different path — until they reach the destination.

Routing Tables and Protocols

Every router maintains a routing table — essentially a map of which networks it knows about and how to reach them. Routers share this information with each other using routing protocols, a common language for communicating network information.

Static vs Dynamic Routes

Static routes — fixed paths, manually set by a network administrator; the data always takes the same route
Dynamic routes — flexible paths that change based on network conditions (congestion, outages, speed). Most internet traffic uses dynamic routing.

Useful Networking Commands

If you want to explore networking hands-on, here are some commands available on Linux (and often Windows/macOS equivalents):

Task	Modern Command	Legacy Command
Find your IP address	`ip address show`	`ifconfig`
View routing information	`ip route`	`route` or `netstat -rn`
View DNS configuration	`nmcli`	`cat /etc/resolv.conf`
Look up a domain's IP	`dig google.com`	`nslookup google.com`
Trace the route to a host	`traceroute google.com`	`traceroute google.com`
Check if a host is reachable	`ping google.com`	`ping google.com`

Note on ping: Many servers filter out ping responses for security reasons, so no response doesn't necessarily mean the host is down.

Putting It All Together

Here's a quick mental model of how everything connects when you browse a website:

You type example.com → DNS translates it to an IP address
Your data is broken into packets with source/destination addresses
Your home router sends them out via NAT through your single public IP
Routers across the internet pass the packets along, each choosing the best route
The packets arrive at the destination server (possibly out of order) and are reassembled using their sequence numbers
The server responds and the whole process happens in reverse

Where to Go From Here

Now that you have the foundations, some great next topics to explore are:

Subnetting Fundamentals — go deeper into dividing networks
Routing Fundamentals — understand how routers make their decisions in detail
Linux Networking — hands-on configuration and troubleshooting
Cloud Networking — how AWS, Azure, and Google Cloud implement virtual networks

Networking is one of those topics where the basics unlock everything else in IT and cloud computing. Once you understand how data moves, the rest starts to make a lot more sense.

Distributed Systems: Wrap-Up

Cloud Tuned — Fri, 14 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up ← you are here	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Distributed Systems: Wrap-Up

Twenty concepts in the final pillar. This post ties them together and closes the series with a complete picture of the URL shortener's architecture across all eight pillars.

The one thing to remember from each post

Network Partitions — partitions are inevitable. Every distributed system must have an explicit policy: stay available and accept potential inconsistency (AP), or refuse requests until the partition heals (CP). CAP doesn't tell you which to choose — it tells you the choice is unavoidable.

Split-Brain — two nodes simultaneously acting as primary cause data corruption. Prevention requires fencing (STONITH): ensure the old primary cannot write before promoting a new one. Never run critical databases in a two-node configuration without a tiebreaker.

Heartbeats — nodes detect peer failures through absence of periodic signals. Too short a timeout causes false positives; too long delays recovery. Phi accrual failure detectors adapt to observed network behaviour, reducing false positives under load.

Leader Election — Raft's term-based election: increment term, campaign for votes, win with a majority, prevent old leaders from reasserting via term comparison. The election gap (old leader dead to new leader established) is the fundamental cost of leader-based coordination.

Consensus Algorithms — Paxos, Raft, Zab allow distributed nodes to agree on a value despite a minority of failures. Safety (all decide the same thing) is unconditional; liveness (a decision is eventually made) requires a functioning majority.

Quorum — R + W > N guarantees consistency: the read quorum and write quorum overlap. Increase W and R for consistency; decrease for availability. Cassandra's tunable consistency levels let you choose per operation.

Paxos — prepare (gather promises not to accept older proposals) then accept (commit the value if promises hold). The foundational consensus algorithm. Correct but hard to implement; the inspiration for Raft.

Raft — leader election + log replication + safety. A candidate must win majority votes with an up-to-date log. The leader replicates entries to followers; commits when a majority acknowledge. Term numbers prevent old leaders from reclaiming authority.

Gossip Protocol — each node randomly selects a few peers and exchanges state every second. Information propagates exponentially fast (O(log N) rounds) with no central coordinator. Used by Cassandra, Redis Cluster, and Consul for cluster membership.

Logical Clocks — physical clocks drift and can't establish causal order between events on different machines. Logical clocks track the happened-before relation instead.

Lamport Timestamps — monotonically increasing counters: increment on events, take max+1 on message receive. If A → B then ts(A) < ts(B). Can't detect concurrency.

Vector Clocks — one counter per process. Comparing two vector clocks reveals whether one happened-before the other or they're concurrent (each has a higher counter in some dimension). The basis of conflict detection in Dynamo and Riak.

Distributed Transactions — atomicity across multiple nodes is expensive and fragile. Most cross-service coordination that appears to require distributed transactions can be handled with Saga patterns instead. Use CockroachDB or Spanner when you genuinely need distributed ACID.

Two-Phase Commit — prepare (all vote, hold locks) then commit/abort (coordinator sends decision). Guarantees atomicity but blocks if coordinator fails after votes but before commit message. The foundation of XA transactions.

Three-Phase Commit — adds a pre-commit phase to allow independent commit decisions if coordinator fails. Works under crash failures; unsafe under network partitions. Rarely used in practice.

Delivery Semantics — at-most-once (fast, lossy), at-least-once (reliable, possible duplicates), exactly-once (expensive, achievable within Kafka-to-Kafka). At-least-once with idempotent consumers is the practical choice for almost everything.

Change Data Capture — read the database's WAL and stream every change as an event. The application writes once (to the database); CDC handles fan-out to Elasticsearch, data warehouses, cache invalidation, and the Outbox relay. Debezium is the standard tool.

Erasure Coding — split data into k data chunks and n−k parity chunks; any k of n reconstruct the data. 10+4 Reed-Solomon tolerates 4 node failures with 40% storage overhead versus 200% for 3× replication. Best for cold, large, infrequently accessed data.

Merkle Trees — a tree of hashes where each parent is a hash of its children. Comparing root hashes proves identical content; different roots identify diverged subtrees in O(k log n) comparisons. Used by Cassandra for anti-entropy repair, Git for change detection, and Bitcoin for lightweight transaction verification.

Observability — structured logs (what happened), metrics (how is the system behaving), and distributed traces (where did the time go). Together they make a distributed black box understandable at runtime. Not optional in microservices.

How the pillars connect

Pillar 1 (Foundations) established the theoretical limits: CAP theorem, consistency models, ACID vs BASE. Every decision in Pillars 4–8 is a concrete application of those tradeoffs.

Pillar 2 (Networking) established the transport: DNS, CDN, HTTP/TLS. The infrastructure that every distributed system rides on.

Pillar 3 (APIs) established communication patterns: REST, gRPC, WebSockets, webhooks. The interface between services and clients.

Pillar 4 (Data & Storage) established what to store and where: PostgreSQL, Cassandra, Redis, S3, Elasticsearch, vector databases. Every data decision in Pillar 8 references a system introduced in Pillar 4.

Pillar 5 (Caching) established the performance layer: cache-aside, eviction, distributed cache, stampede prevention. The reason most systems don't feel slow despite touching many data stores.

Pillar 6 (Scalability) established the infrastructure layer: load balancing, rate limiting, compression, checksums, probabilistic data structures. The plumbing between the internet and application code.

Pillar 7 (Architecture) established the system structure: how services are decomposed, how they communicate asynchronously, how resilience is built in, how data flows through pipelines. The blueprint that Pillar 8 fills in with correctness guarantees.

Pillar 8 (Distributed Systems) established what happens under failure: network partitions, split-brain, consensus, causal ordering, distributed transactions, data integrity, and observability. The foundation that makes all the other pillars work correctly under adversarial conditions.

The complete URL shortener architecture

═══════════════════════════════════════════════════════════════════
PILLAR 2: Networking
  DNS: Route 53 with geo-routing → nearest edge
  CDN: Cloudflare (Anycast, TLS termination, popular redirect caching)
═══════════════════════════════════════════════════════════════════

PILLAR 6: Scalability
  Rate limiting: Token bucket per API key (Redis counters, global)
  Load balancer: AWS ALB, least-connections algorithm
  Compression: Brotli for all text responses

PILLAR 7: Architecture (entry layer)
  BFF: Mobile BFF + Web BFF + Public API BFF
  Reverse proxy: Nginx (TLS termination, routing, compression)

═══════════════════════════════════════════════════════════════════
SERVICES
═══════════════════════════════════════════════════════════════════

Redirect Service (Go, latency-critical):
  PILLAR 5: Redis cache (consistent hashing, LFU eviction, SWR)
  PILLAR 4: PostgreSQL on cache miss (index scan on short_code)
  PILLAR 6: Bloom filter for unique visitor dedup (RedisBloom)
  PILLAR 6: HyperLogLog for unique visitor counts (Redis PFADD)
  PILLAR 8: Circuit breaker on PostgreSQL calls
  PILLAR 8: Distributed traces (OpenTelemetry → Honeycomb)

Link Service (Rails, CRUD + workflows):
  PILLAR 4: PostgreSQL (primary + 2 read replicas)
  PILLAR 7: Saga pattern for subscription upgrade workflow
  PILLAR 7: Outbox pattern (PostgreSQL outbox → Debezium CDC → Kafka)
  PILLAR 8: CDC via Debezium → link.created events to Kafka
  PILLAR 8: Circuit breakers on all external calls

Analytics Service (Python, stream + batch):
  PILLAR 4: Cassandra (click events, wide-column, TWCS compaction)
  PILLAR 4: TimescaleDB (metrics aggregates)
  PILLAR 7: CQRS: write to Cassandra, read from pre-aggregated views
  PILLAR 7: Stream processing: Kafka → Flink (real-time counters)
  PILLAR 7: Batch processing: nightly Spark job (historical aggregates)
  PILLAR 8: Gossip-based cluster membership (Cassandra ring)
  PILLAR 8: Quorum reads (LOCAL_QUORUM for dashboard queries)

User / Billing / Notification Services:
  PILLAR 4: PostgreSQL (each service owns its schema)
  PILLAR 7: Event-driven reactions to Kafka events
  PILLAR 7: Serverless (Lambda) for scheduled jobs

═══════════════════════════════════════════════════════════════════
DATA INFRASTRUCTURE
═══════════════════════════════════════════════════════════════════

PILLAR 4: Data stores
  PostgreSQL primary (EBS block storage, db.r5.8xlarge)
  PgBouncer (connection pool, transaction mode, 40 connections)
  Redis Cluster (3 primaries + 3 replicas, consistent hashing)
  Cassandra Cluster (6 nodes, gossip membership, phi accrual detection)
  InfluxDB (infrastructure metrics, time-window compaction)
  Elasticsearch (full-text search, Debezium CDC sync)
  Pinecone (semantic embeddings, HNSW ANN search)
  S3 (user uploads, exports; erasure-coded durability internally)

PILLAR 8: Distributed systems layer
  Raft consensus: etcd (Kubernetes state), CockroachDB (billing transactions)
  Gossip: Cassandra ring membership, Redis Cluster slot state
  Anti-entropy: Cassandra repair via Merkle tree comparison
  CDC: Debezium watching PostgreSQL WAL → Kafka
  Observability:
    Logs: structured JSON → Loki → Grafana
    Metrics: Prometheus → Grafana (p99 latency, error rate, saturation)
    Traces: OpenTelemetry → Honeycomb (distributed trace per request)

PILLAR 8: Failure handling
  Split-brain prevention: Patroni with EC2 instance termination fencing
  Heartbeats: 10s interval, phi accrual failure detection (Cassandra)
  Leader election: Raft (etcd), Patroni (PostgreSQL HA)
  Delivery semantics: at-least-once + idempotent consumers everywhere
  Partition policy:
    PostgreSQL: CP (Patroni requires quorum for promotion)
    Cassandra: AP tunable (LOCAL_QUORUM for reads, ONE for click writes)
    Redis: AP (last-write-wins on partition heal)

PILLAR 7 → 8 bridges:
  Saga + Outbox + CDC: subscription upgrades are saga-coordinated;
    each step uses outbox pattern for reliable event publishing;
    Debezium picks up outbox rows without polling
  Circuit breaker + Observability: every tripped circuit breaker
    emits a metric and a trace span, surfaced immediately in Grafana

What this series covered

Eight pillars. Roughly 120 concepts. One URL shortener that started as a single Rails app on a single server and ended as a distributed, multi-region, observable, fault-tolerant platform.

The series deliberately follows the same thread: every concept was introduced when the running example created a problem that concept solves. Consistent hashing appeared because the Redis cluster needed to add nodes without invalidating the cache. The Outbox pattern appeared because the Saga needed reliable event publishing. Merkle trees appeared because Cassandra needed to repair replicas that diverged during a partition.

Systems design isn't a collection of unrelated tools. It's a set of problems that naturally occur as systems grow, and a set of patterns that address each problem with known tradeoffs. The goal of this series was to build that mental model — not a checklist, but a way of reasoning about what a system needs and why.

Thank you for reading

This is the final post in the series. If you found it useful, the best thing you can do is share the specific posts that helped you — the ones that clicked something into place, answered a question you'd had for a while, or gave you language for something you'd been doing instinctively.

The series starts here: Pillar 1 — Foundations

*← Previous: Observability — structured logs, metrics, and distributed tracing: the tools that make distributed systems understandable at runtime.*This is the final post in the System Design series. Start from the beginning: Pillar 1 — Foundations

Observability: Understanding Your System at Runtime

Cloud Tuned — Thu, 13 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime ← you are here	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Observability: Understanding Your System at Runtime

The problem

A user reports that redirects for their links are slow — sometimes 3–4 seconds instead of the normal 10ms. You look at the application dashboard: everything green. No errors logged. CPU, memory, and database query times all look normal.

But the user is experiencing real slowness. Somewhere in the system, something is wrong. Without instrumentation, you're debugging blind: the system is a black box, and the only signal you have is "users are unhappy."

This is the problem observability solves. An observable system tells you what it's doing at runtime — not just whether it's up, but why specific requests are slow, where in the call chain the latency is, what the error rate is per service, and which specific dependency is causing the problem.

The core idea

Observability is the property of a system that allows you to understand its internal state from its external outputs. In practice, observability consists of three types of telemetry data — logs, metrics, and traces — each answering a different question: what happened, how is the system behaving, and where did the time go.

The analogy: a pilot's cockpit instruments

An observable system is an aircraft with a full instrument panel. The pilot doesn't need to step outside to check if the engine is running — the tachometer shows RPM. They don't need to feel the altitude — the altimeter shows it. They don't need to hear turbulence — the autopilot's course deviation indicator shows it.

Each instrument answers a specific question. No single instrument tells the whole story — but together they give the pilot a complete picture of the aircraft's state.

An unobservable system is flying by feel in clouds: you know you're moving, you know roughly where you started, but you have no clear picture of where you are or what's about to go wrong.

The three pillars

Logs: what happened

Logs are timestamped, structured records of individual events in the system. They answer: "what exactly happened, and when?"

Structured logging (JSON) is the modern standard. Plain text logs are machine-readable only with complex parsing; JSON logs can be queried directly:

{
  "timestamp": "2025-06-01T14:00:00.123Z",
  "level": "INFO",
  "service": "redirect-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Redirect completed",
  "short_code": "x7Kp2",
  "destination": "https://example.com",
  "duration_ms": 8,
  "cache_hit": true,
  "user_agent": "Mozilla/5.0..."
}

Every log entry carries the trace ID (to correlate with a distributed trace) and span ID (to identify the specific operation within the trace).

Log aggregation: services ship logs to a central store (Elasticsearch/Kibana, Datadog, Loki/Grafana). Engineers query across all services by time, service, trace ID, user, error message, etc.

Key design principles: log at appropriate levels (DEBUG for development, INFO for normal operations, WARN for recoverable issues, ERROR for failures); include context (IDs, user references, relevant parameters); never log PII in production.

Metrics: how is the system behaving

Metrics are numeric measurements of system behaviour over time. They answer: "is the system healthy, and what's its current performance?"

The four Golden Signals (from Google's Site Reliability Engineering book):

Latency: how long requests take (p50, p95, p99)
Traffic: how many requests per second
Errors: error rate (% of requests failing)
Saturation: how full the system is (CPU %, memory %, queue depth)

RED method (for services): Rate, Errors, Duration — the minimum viable metrics for any service.

USE method (for infrastructure): Utilisation, Saturation, Errors — for CPUs, memory, disks, networks.

Prometheus + Grafana is the standard open-source stack. Services expose a /metrics endpoint in Prometheus format; Prometheus scrapes it; Grafana visualises it.

# Prometheus metrics exposed by the redirect service
http_requests_total{method="GET",path="/r",status="200"} 1423904
http_request_duration_seconds_bucket{le="0.01"} 1390000
http_request_duration_seconds_bucket{le="0.1"} 1423500
redirect_cache_hits_total 1401234
redirect_cache_misses_total 22670

Alerting: alert rules fire when metrics cross thresholds. p99 latency > 100ms → PagerDuty alert. Error rate > 1% → alert. Queue depth > 1000 → alert.

Distributed tracing: where did the time go?

A trace records the journey of a single request through all the services it touched. It answers: "for this specific request, which service took how long, and where exactly was the time spent?"

Without tracing, a request that touches 5 services produces 5 separate log streams. Correlating them manually to understand which service was slow is laborious and error-prone.

With tracing, the entire request's journey is captured in one trace:

Trace: abc123 (total: 3247ms)
  [Redirect Service: 3247ms]
    ├─ DNS lookup: 1ms
    ├─ [Redis Cache: 8ms] → MISS
    └─ [PostgreSQL Query: 3230ms] ← THE SLOW PART
         └─ query: SELECT destination FROM links WHERE short_code='x7Kp2'
              indexes used: none (missing index on short_code)
              rows scanned: 50,000,000

This trace immediately identifies the problem: no index on short_code. Without tracing, this 3-second slowness would be a mystery — the application logs show "query executed," the database logs show "slow query," but connecting them requires manual correlation.

OpenTelemetry: the standard for instrumentation. Services emit spans (individual operations within a trace) in OpenTelemetry format. Backends (Jaeger, Honeycomb, Datadog) store and visualise traces.

Trace propagation: when Service A calls Service B, it passes the trace ID and span ID in HTTP headers (traceparent: 00-abc123-def456-01). Service B creates a child span under the parent span. The entire distributed call chain is captured.

# Request headers from Service A to Service B
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
                  ↑ trace ID                    ↑ parent span ID

The relationship between pillars

The three pillars answer different questions but are most powerful combined:

An alert fires (metric): redirect p99 latency > 500ms → Open traces for the time period → find the slow traces → Identify which span is slow (e.g., PostgreSQL query: 3.2s) → Check logs for that trace ID → confirm the exact query, parameters, and context → Root cause identified in minutes, not hours

Each pillar narrows the investigation. Metrics surface the symptom. Traces locate the service and operation. Logs provide the specific context.

See our interactive diagram below:

Tradeoffs

Sampling. High-traffic services can't record 100% of traces — the storage cost is too high. Most tracing systems sample: record 1% or 10% of traces. This means rare slow requests may not be captured. Adaptive sampling (always record traces with high latency or errors) mitigates this.

Cardinality explosion. Metrics with high-cardinality labels (one label value per user ID, per URL, per session) create millions of time series — most systems can't handle this. Keep metric label cardinality low; use logs or traces for high-cardinality data.

Operational overhead. Running a log aggregation stack (Elasticsearch), a metrics system (Prometheus + Grafana), and a tracing backend (Jaeger) is three separate systems to maintain. Managed services (Datadog, Honeycomb, New Relic) trade cost for operational simplicity.

Instrumentation discipline. Observability requires intentional instrumentation — it doesn't happen automatically. Every service must emit structured logs, expose metrics, and propagate trace headers. This is an ongoing engineering effort, not a one-time setup.

The one thing to remember

Observability is not monitoring — it's the property of a system that lets you ask arbitrary questions about its behaviour at runtime. Logs answer what happened. Metrics answer how the system is performing. Traces answer where time was spent in a specific request. Together, they turn a distributed black box into a system you can reason about during incidents. In a microservices system, observability is not optional — without it, production debugging is archaeology.

← Previous: Merkle Trees — efficiently comparing large datasets across distributed nodes to find which parts have diverged.

→ Next: Distributed Systems — Wrap-up — tying together all 20 concepts in this final pillar and the complete series.

Merkle Trees: Efficiently Finding What's Different

Cloud Tuned — Wed, 12 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different ← you are here	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Merkle Trees: Efficiently Finding What's Different

The problem

Your Cassandra cluster has two replicas of the same partition. After a network partition healed, you're not sure they're in sync — some writes may have gone to only one replica during the partition. You need to identify which records differ so you can repair the inconsistency.

The brute-force approach: send every row from Replica A to Replica B, compare each one. For a partition with 100 million rows, that's 100 million rows transferred over the network — enormous I/O just to find a handful of diverged records.

A smarter approach: compare summaries of the data rather than the data itself. If the summary of all 100 million rows is the same on both replicas, no comparison needed. If they differ, recursively drill into subsections until you find the specific rows that diverged.

This is the Merkle tree.

The core idea (see our interactive diagram)

A Merkle tree is a binary tree where each leaf node contains the hash of a data block, and each non-leaf node contains the hash of its children's hashes. The root hash is a single value that represents the entire dataset: if any data block changes, the hash propagates up through the tree, changing the root hash. Two datasets with the same root hash have identical content; different root hashes mean something differs, and the tree structure reveals exactly where.

The analogy: a library's section catalogue

A librarian checks whether two library branches have identical collections. Rather than comparing every book:

Compare the hash of the entire "Science Fiction" section. Same? No need to check individual SF books.
Compare the hash of "Classic SF". Different → drill deeper.
Compare "Classic SF, A–L". Same. Compare "Classic SF, M–Z". Different → drill deeper.
Identify the specific shelf, then the specific book that differs.

Each level narrows the search. Instead of comparing every book (O(n)), you do O(log n) comparisons to find each differing item.

How Merkle trees work

Construction (interactive diagram)

Two datasets with the same root hash are guaranteed identical (assuming collision-resistant hashes). Two datasets with different root hashes differ in at least one block.

Efficient comparison

To find which blocks differ between two Merkle trees:

Compare root hashes. Same → done (all data identical). Different → proceed.
Compare the two children of the root. Find which subtree(s) differ.
Recursively compare the differing subtrees.
Continue until leaf nodes — the specific differing blocks are identified.

In a balanced binary tree with n leaves, this takes O(k log n) comparisons, where k is the number of differing blocks. For a dataset with 100 million blocks and only 100 differing, this is dramatically fewer comparisons than comparing all 100 million.

Where Merkle trees are used

Cassandra anti-entropy repair

Cassandra's nodetool repair command uses Merkle trees to synchronise replicas. For a given token range:

Each replica builds a Merkle tree of its data for that range
Replicas exchange their root hashes (gossip)
If hashes differ, replicas compare trees recursively to find differing rows
Only the differing rows are transferred and reconciled

Without Merkle trees, repair would require comparing every row — too expensive for multi-terabyte partitions. With Merkle trees, only divergent rows are identified and repaired.

Git

Git's internal object model is a Merkle tree. Every file is stored as a "blob" object (content + hash). Every directory is a "tree" object containing hashes of its children (files and subdirectories). Every commit is a "commit" object containing the hash of the root tree.

If two commits have different root tree hashes, something in the codebase changed. Finding what changed means traversing the tree to find which subtrees (directories) differ — O(log n) comparisons instead of comparing every file.

Bitcoin blockchain

Each block in the Bitcoin blockchain contains a Merkle root of all transactions in that block. To verify that a specific transaction is included in a block, you need only log(n) hashes (the "Merkle proof") rather than all n transactions. Light clients (mobile wallets) use this to verify transactions without downloading the full blockchain.

DynamoDB/Cassandra replication synchronisation

Both use variants of Merkle trees (hash trees) for comparing replica state and identifying diverged segments for repair.

Tradeoffs

Tree construction cost. Building a Merkle tree over a large dataset requires hashing every data block. For a 1 TB partition, this is significant I/O and CPU — Cassandra's repair process is resource-intensive and should be run during low-traffic periods.

Freshness vs accuracy. The tree is a snapshot at construction time. If data is changing rapidly while the tree is being built, the tree may be immediately stale. Cassandra limits repair scope to control this.

Depth vs granularity. A deeper tree finds differences with more precision (smaller data blocks per leaf) but requires more comparison steps. The right tree depth depends on the ratio of total data to expected number of differences.

The one thing to remember

A Merkle tree makes comparing large datasets efficient: the root hash proves identity; differing subtrees narrow the search. Instead of comparing every record (O(n)), you compare log(n) hashes to identify exactly which data blocks differ. This is how Cassandra finds inconsistent replicas without transferring terabytes of data, how Git tracks repository changes, and how Bitcoin verifies transactions without a full blockchain download.

← Previous: Erasure Coding — storing data redundantly using mathematics rather than full replication.

→ Next: Observability — structured logs, metrics, and distributed tracing: the tools that make distributed systems understandable at runtime.

Erasure Coding: Fault Tolerance Without Full Replication

Cloud Tuned — Tue, 11 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication ← you are here	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Erasure Coding: Fault Tolerance Without Full Replication

The problem

Amazon S3 stores every object with eleven nines (99.999999999%) durability — the probability of losing an object is essentially zero. If S3 achieved this through 3× replication (three full copies of every object), storing 1 PB of data would require 3 PB of raw storage. At S3's scale (hundreds of exabytes), this is an enormous storage overhead.

Is there a way to achieve the same durability guarantee with less storage overhead?

Yes: erasure coding.

The core idea

Erasure coding divides data into k data chunks, encodes them into n total chunks (n > k) using a mathematical transformation, and distributes the n chunks across n nodes. Any k of the n chunks are sufficient to reconstruct the original data. The system can tolerate losing up to n − k chunks (nodes) without data loss.

The overhead: n/k instead of a full replication factor. A common configuration is 10+4 (k=10, n=14): the original data is encoded into 14 chunks; any 10 are sufficient for recovery; 4 nodes can fail without data loss. Storage overhead: 14/10 = 1.4× — versus 3× replication for equivalent fault tolerance.

The analogy: a torn photograph reconstructed from fragments

A photograph is cut into 14 pieces and distributed to 14 people. The rule: any 10 of the 14 pieces are sufficient to reconstruct the complete photograph. Even if 4 people lose their pieces, the photograph can still be recovered from the remaining 10.

The encoding adds some redundancy (14 pieces instead of 10 direct copies), but far less redundancy than keeping 3 complete photographs.

How erasure coding works (interactive diagram)

Reed-Solomon codes

The most common erasure coding scheme. The original data is treated as a polynomial; the k data chunks are the polynomial's coefficients. The n total chunks are evaluations of the polynomial at n distinct points. Any k evaluations are sufficient to recover the original polynomial (and thus the data) through polynomial interpolation.

Original data: 10 chunks of 100 MB each = 1 GB

Encoding (10+4 Reed-Solomon):
  Compute 4 additional "parity" chunks using polynomial math
  Result: 14 chunks × 100 MB = 1.4 GB stored (1.4× overhead)
  
  Any 10 of 14 chunks reconstruct the original
  System tolerates loss of any 4 chunks (nodes)
  
vs 3× replication:
  3 complete copies × 1 GB = 3 GB stored (3× overhead)
  Tolerates loss of any 2 copies

The reconstruction process (interactive diagram)

If a node fails and its chunk is lost, the system must reconstruct it:

Read any k surviving chunks
Apply the inverse Reed-Solomon transform (polynomial interpolation)
Recover the original data
Re-encode the missing chunk
Write the recovered chunk to a replacement node

This is more computationally expensive than simply copying a replica — reconstruction requires reading k chunks and computing the inverse transform. For large objects, reconstruction is I/O and CPU intensive.

Where erasure coding is used

Amazon S3: uses erasure coding internally (specific scheme not disclosed). Objects are split across multiple storage nodes; the system can recover from multiple simultaneous node failures.

HDFS (Hadoop): HDFS 3.x supports erasure coding via EC policies. Common configuration: 6+3 (6 data blocks, 3 parity blocks). For cold data that's rarely accessed, erasure coding reduces storage from 3× to 1.5× while maintaining the same fault tolerance.

Ceph: supports erasure-coded pools alongside replicated pools. Cold data goes to EC pools for storage efficiency.

Meta's Tectonic: Meta's internal distributed filesystem uses erasure coding for the vast majority of cold data storage.

RAID 5/6 (local disks): the familiar RAID scheme is erasure coding: RAID 5 is k+1 (one parity disk), RAID 6 is k+2 (two parity disks).

Tradeoffs

Storage efficiency vs recovery cost. Erasure coding is more storage-efficient than replication. The cost: recovery requires reading k chunks (more I/O than copying one replica) and computing the inverse transform (more CPU). For hot data accessed frequently, the reconstruction cost makes erasure coding impractical.

Write overhead. Writing a new object requires computing all n chunks (encoding), not just writing k copies. Encoding is fast (Reed-Solomon is a well-optimised operation), but it's more work than a simple replicate-and-write.

Latency. Reading an erasure-coded object requires reading from k nodes simultaneously. Network round-trips to k nodes add latency compared to reading from the nearest replica.

Best for cold/archive data. Erasure coding's efficiency advantage is most valuable for large, infrequently accessed datasets (backups, logs, cold storage). For frequently accessed hot data, replication's simpler read path (read from nearest replica, no reconstruction) usually wins.

The one thing to remember

Erasure coding achieves fault tolerance with less storage overhead than replication by splitting data into k data chunks and n−k parity chunks, such that any k of the n total chunks reconstruct the original. A 10+4 configuration tolerates 4 node failures with only 40% storage overhead — versus 200% overhead for 3× replication. The cost is more complex reads and writes, and expensive reconstruction when chunks are lost. Erasure coding is the right choice for large, cold datasets where storage cost matters more than read performance.

← Previous: Change Data Capture — streaming database changes in real time by reading the write-ahead log.

→ Next: Merkle Trees — efficiently comparing large datasets across distributed nodes to find which parts have diverged.

Change Data Capture: Streaming Your Database in Real Time

Cloud Tuned — Mon, 10 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time ← you are here	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Change Data Capture: Streaming Your Database in Real Time

The problem

Your URL shortener's links table in PostgreSQL is the source of truth. Multiple downstream systems need to stay current:

Elasticsearch must index new and updated links for search
The analytics data warehouse must receive link metadata changes
The Outbox pattern relay must publish events from the outbox table
A Redis cache must be invalidated when link destinations change

The naive approach: the application that writes to PostgreSQL also writes to each downstream system. This is dual-write — we've seen its problems (post 10, Outbox pattern). The write to PostgreSQL and the writes to downstream systems are not atomic; if any downstream write fails, systems diverge.

The better approach: instead of the application writing to multiple places, let the database itself be the single source of truth and stream every change to an event bus. Downstream systems consume from the event bus. This is Change Data Capture.

The core idea

Change Data Capture (CDC) reads the database's internal change log (the Write-Ahead Log for PostgreSQL, the binary log for MySQL) and streams every INSERT, UPDATE, and DELETE as an event to downstream consumers. The application writes only to the database. The CDC pipeline handles the fan-out.

The analogy: a stenographer recording every decision

A committee meeting makes decisions verbally. Rather than have the chairperson personally notify every stakeholder, a stenographer records every statement in real time. Stakeholders who need to know about decisions subscribe to the stenographer's transcript.

The committee (application) speaks once. The stenographer (CDC) captures it all. Stakeholders (downstream systems) get the full record without the committee knowing they exist.

How CDC works

The Write-Ahead Log (WAL)

PostgreSQL (and most relational databases) maintain a WAL — a sequential, append-only log of every change committed to the database. The WAL's primary purpose is durability: if the database crashes, it replays the WAL on recovery.

CDC reads the WAL and converts its binary records into structured change events:

WAL entry: INSERT into links (id='x7Kp2', dest='https://example.com', user_id=123)
CDC event:
  {
    "op": "c",              // c=create, u=update, d=delete
    "source": {"table": "links", "lsn": 291840},
    "before": null,         // no previous state on insert
    "after": {
      "id": "x7Kp2",
      "dest": "https://example.com",
      "user_id": 123,
      "created_at": "2025-06-01T14:00:00Z"
    }
  }

For an UPDATE, both before and after are populated. For a DELETE, before is the deleted row, after is null.

Debezium

Debezium is the dominant open-source CDC platform. It runs as a Kafka Connect connector: it reads the PostgreSQL WAL (or MySQL binlog, MongoDB oplog, etc.) and publishes change events to Kafka topics.

PostgreSQL WAL
  → Debezium PostgreSQL connector (Kafka Connect)
  → Kafka topic: postgres.public.links
  → Consumers:
      Elasticsearch connector (indexes new/changed links)
      Analytics connector (writes to data warehouse)
      Outbox relay connector (publishes Outbox messages)
      Cache invalidation service (deletes Redis keys on change)

Logical replication slots (PostgreSQL)

For CDC to work, PostgreSQL must preserve WAL entries until the CDC consumer has read them. This is done via a logical replication slot: a named cursor in the WAL that tracks how far the CDC consumer has read. PostgreSQL retains WAL up to the oldest slot's position.

-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

Important: an unused replication slot causes WAL to accumulate indefinitely — PostgreSQL won't clean it up. A stalled CDC pipeline with a replication slot can fill the database's disk. Monitor replication slot lag.

What CDC enables

Real-time search index updates. Instead of a nightly Elasticsearch reindex, every link INSERT or UPDATE propagates to Elasticsearch within seconds. Search results reflect the latest data.

Cache invalidation. When a link's destination is updated, a CDC consumer deletes the corresponding Redis cache entry. The next redirect hits the database for the fresh value — no TTL wait for stale data.

Event sourcing without application changes. The database's change log becomes an event stream. Downstream systems can replay from any point in history.

Data warehouse synchronisation. The ELT pipeline (Pillar 7, post 18) can use CDC for near-real-time data warehouse updates instead of nightly batch ETL jobs.

Outbox pattern relay. The most reliable Outbox implementation uses CDC: Debezium watches the outbox table and publishes messages to Kafka as rows are inserted. No polling loop, no additional database queries.

Tradeoffs

WAL format coupling. Debezium reads the WAL directly. Major schema changes (dropping a column, changing a column type) can break the CDC pipeline if the connector's schema configuration isn't updated simultaneously.

Replication slot lag. A slow or stalled CDC consumer causes WAL retention to grow. PostgreSQL must keep WAL segments for the stalled consumer, potentially filling disk. Monitor pg_replication_slots and alert on lag.

Ordering and partitioning. Debezium publishes events in WAL order within a table, but across tables, ordering is not guaranteed. For consumers that depend on cross-table ordering (applying a parent row insert before a child row insert), careful partitioning of Kafka topics is required.

Not for all databases. CDC via WAL requires the database to support logical replication (PostgreSQL 10+, MySQL 5.5+ binlog, MongoDB 3.6+ oplog). Older databases or some cloud-managed databases may not expose the WAL for CDC.

The one thing to remember

CDC turns your database into an event source by reading its internal change log — the WAL — and streaming every commit as a structured event. The application writes only to the database; CDC handles fan-out to all downstream consumers (search, cache, analytics, other services). This solves the dual-write problem: the database write and the downstream event are guaranteed to be consistent because the event is derived from the already-committed WAL, not written separately. Debezium is the standard tool for this in PostgreSQL/MySQL/MongoDB environments.

← Previous: Delivery Semantics — at-most-once, at-least-once, and exactly-once: what each guarantees and what it costs.

→ Next: Erasure Coding — storing data redundantly using mathematics rather than full replication.

Delivery Semantics: What Does \"Delivered\" Actually Mean?

Cloud Tuned — Sun, 09 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean? ← you are here	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Delivery Semantics: What Does "Delivered" Actually Mean?

The problem

Your URL shortener's click pipeline publishes events to Kafka. The analytics consumer processes each event and increments a counter. What guarantees does the system provide?

If the consumer crashes after processing an event but before committing its Kafka offset, Kafka will re-deliver the event on restart. The counter gets incremented twice for one click — double-counting.

If the consumer commits its offset before processing, and then crashes during processing, the event is marked consumed but was never processed — the counter was never incremented. The click is lost.

Both outcomes are wrong. The question is: which wrongness are you willing to accept, and what does it cost to avoid it?

The core idea

Delivery semantics describe what a messaging system guarantees about how many times a message is delivered to a consumer: at most once (possibly zero), at least once (possibly more than once), or exactly once (exactly one). Each guarantee reflects a different tradeoff between performance, reliability, and implementation complexity.

The analogy: postal delivery policies

At-most-once: the postal service attempts delivery once. If no one is home, the letter is discarded — no second attempt. You might not get the letter (under-delivery), but you'll never get the same letter twice.

At-least-once: the postal service attempts delivery and tries again if unconfirmed. You will eventually receive the letter — possibly multiple copies if the first delivery wasn't confirmed but was received.

Exactly-once: the postal service guarantees you receive the letter exactly once. This requires tracking every delivery, detecting duplicates, and deduplicating before handing you the envelope. Much more work for the postal service.

At-Most-Once

A message is delivered zero or one times. If something fails, the message may be lost — but will never be processed twice.

How: the producer sends a message and doesn't retry on failure. The consumer acknowledges before processing — if it crashes after acknowledging but before processing, the message is gone.

Producer → sends event → does not retry on failure
Consumer → acks message → processes event
         ↑
         Ack before processing: if crash here, message lost

Use cases: metrics, telemetry, non-critical logs. Losing an occasional event is acceptable; duplicate processing would corrupt aggregates.

Cost: possible message loss.

At-Least-Once

A message is delivered one or more times. The consumer will process every message, but may process some more than once.

How: the producer retries until the broker acknowledges. The consumer processes before acknowledging — if it crashes after processing but before acknowledging, the message is re-delivered and re-processed.

Producer → sends event
  → timeout → retry (message sent again)
  → broker acks → stop retrying

Consumer → processes event
         → acks message
         ↑
         If crash here, message is re-delivered and re-processed

The idempotency requirement: if a consumer might process the same message twice, it must be idempotent — processing the same message twice produces the same result as processing it once.

# Non-idempotent (wrong for at-least-once):
def process_click(event):
    db.execute("UPDATE links SET click_count = click_count + 1 WHERE id = ?", event.link_id)
    # If this runs twice: click_count is incremented twice

# Idempotent (correct):
def process_click(event):
    db.execute("INSERT INTO click_events (id, link_id, ...) VALUES (?, ?, ...)",
               event.id, event.link_id, ...)
    # If event.id already exists (UNIQUE constraint): ignore duplicate silently

Use cases: most event-driven workloads. The most common production choice.

Cost: consumers must be idempotent; some events are processed more than once (but with idempotency, the end result is correct).

Exactly-Once

A message is delivered exactly once — no loss, no duplicates. The hardest guarantee.

The problem: "exactly once" in the context of distributed systems requires coordinating the producer, broker, and consumer atomically. The consumer must atomically: process the event AND mark it as consumed — such that if either step fails, both are retried or rolled back together.

Kafka's exactly-once implementation:

Kafka 0.11+ provides exactly-once semantics within a Kafka-to-Kafka workflow:

Idempotent producer: each producer has a unique ID and sequence number per partition. Kafka deduplicates retried produces — the same sequence number from the same producer is committed only once.
Transactional API: a consumer reads from topic A, processes the event, and writes results to topic B — atomically, within a Kafka transaction. Either all three steps succeed (consume + process + produce) or none do.

producer.init_transactions()
try:
    producer.begin_transaction()
    for record in consumer.poll():
        result = process(record)
        producer.send("output-topic", result)
    producer.send_offsets_to_transaction(consumer.offsets, group_id)
    producer.commit_transaction()
except Exception:
    producer.abort_transaction()

Kafka exactly-once is only within Kafka. If the consumer writes to a database (not another Kafka topic), exactly-once requires the database write and Kafka offset commit to be in the same transaction — which requires 2PC across Kafka and the database. This is either unsupported or impractical.

The practical approach for Kafka → Database: use at-least-once delivery with idempotent writes to the database (deduplication by event ID). The result is semantically exactly-once at the application level, without Kafka's transactional complexity.

Choosing delivery semantics

Can you tolerate message loss?
  Yes → At-most-once (fastest, simplest)

Can your consumers be idempotent?
  Yes → At-least-once with idempotent consumers (most common, practical)
  No → Must work toward exactly-once or idempotent redesign

Are all your writes Kafka-to-Kafka?
  Yes → Kafka exactly-once transactions
  No → At-least-once + idempotent writes to database

Tradeoffs

At-most-once: lowest latency and overhead. Wrong for any workload where losing events matters.

At-least-once: the practical default. Requires idempotent consumers — which is a good design principle regardless. Doubles some processing; correctly designed systems are unaffected.

Exactly-once: the highest overhead and complexity. Truly necessary only when the consumer cannot be made idempotent and duplicate processing has visible effects. Often "exactly-once at the application level through idempotency" is a better approach.

The one thing to remember

At-least-once delivery with idempotent consumers is the practical choice for almost all event-driven systems. Accept that messages will occasionally be delivered more than once; design consumers to handle duplicates gracefully (by event ID deduplication, INSERT OR IGNORE, or natural idempotency). Exactly-once is achievable within Kafka-to-Kafka pipelines but not across systems — for cross-system exactly-once semantics, idempotency at the application level is almost always the simpler and more robust approach.

← Previous: Three-Phase Commit — the protocol designed to eliminate 2PC's blocking problem, and why it's rarely used in practice despite solving it.

→ Next: Change Data Capture — streaming database changes in real time by reading the write-ahead log.

Three-Phase Commit: Solving 2PC's Blocking Problem

Cloud Tuned — Sat, 08 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem ← you are here	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Three-Phase Commit: Solving 2PC's Blocking Problem

The problem

Two-Phase Commit blocks when the coordinator crashes after participants vote Yes but before sending the commit message. Participants hold locks indefinitely, waiting for a coordinator that might take minutes to restart.

The root cause: in 2PC, once a participant votes Yes, it has no way to make a safe independent decision. It doesn't know whether the coordinator committed or aborted — and different independent decisions by different participants would violate atomicity.

Three-Phase Commit (3PC) adds an intermediate phase that gives participants enough information to make a safe independent decision if the coordinator fails — eliminating the blocking problem.

The core idea

3PC adds a pre-commit phase between prepare and commit. After collecting all Yes votes, the coordinator sends a Pre-Commit message before the final Commit. This gives participants two crucial properties: (1) they know all other participants have voted Yes, and (2) if the coordinator fails, they can safely complete the commit independently without risking disagreement.

The analogy: the wedding with a dress rehearsal

2PC: the officiant asks "do you?" privately, collects answers, then announces the result. Problem: if the officiant collapses mid-announcement, parties don't know if they're married.

3PC: adds a rehearsal. Before the public ceremony, the officiant gathers both parties and says "I've confirmed you both said yes, and we're about to proceed — are you both still ready?" Only if both confirm does the final ceremony happen. Now if the officiant collapses mid-ceremony, either party knows: (1) we both confirmed readiness, (2) we can safely complete the ceremony ourselves.

The pre-commit is the rehearsal confirmation.

How 3PC works

Phase 1: Prepare (same as 2PC)

Coordinator sends Prepare to all participants. Each votes Yes or No.

If any votes No → Coordinator sends Abort to all (same as 2PC).

Phase 2: Pre-Commit (new phase)

If all voted Yes → Coordinator sends Pre-Commit to all participants.

Each participant that receives Pre-Commit:

Records it durably (survives crash)
Acknowledges to coordinator
Now knows: all participants voted Yes

The key property: a participant in Pre-Commit state knows it's safe to commit if the coordinator fails — because it knows every other participant also voted Yes.

Phase 3: Commit

Coordinator receives acknowledgements from all participants → sends Commit.

Each participant commits and releases locks.

Handling coordinator failure

Coordinator fails in Phase 1 (before Pre-Commit): participants that haven't received Pre-Commit can safely abort — no Pre-Commit means not all votes were Yes, so aborting is safe.

Coordinator fails in Phase 2 (after Pre-Commit to some, before all):

Participants that received Pre-Commit know all voted Yes → they elect a new coordinator from among themselves and complete the commit
Participants that didn't receive Pre-Commit abort
Wait — these decisions must agree! They don't, if the network partitioned the Pre-Commit messages inconsistently

This is where 3PC's limitation emerges.

The key limitation: network partitions

3PC assumes fail-stop failures (nodes crash cleanly) and no network partitions. In reality, network partitions are common (post 01). A partition can cause 3PC to violate atomicity:

Coordinator sends Pre-Commit to A and B, but not C (partition)
Coordinator then crashes

A and B elect a new coordinator among themselves
Both are in Pre-Commit state → they commit
C times out in Prepared state → aborts

Result: A and B committed; C aborted → INCONSISTENT

Under a network partition, 3PC is unsafe: participants that received Pre-Commit commit, while participants that didn't abort — and they're making these independent decisions without knowing about each other's state.

This is why 3PC is almost never used in practice. The real world has network partitions. Under partitions, 3PC provides atomicity guarantees that are weaker than 2PC's (which at least blocks rather than making inconsistent decisions).

Why 3PC is rarely used

Network partitions make it unsafe. As described above — partition during Phase 2 can cause inconsistent commit/abort decisions.

Paxos-based commit is better. A coordinator that uses consensus (Paxos/Raft) to durably record its decision before sending it to participants solves the blocking problem correctly, even under partitions. This is what CockroachDB and Spanner do — they don't implement 3PC.

Saga patterns avoid the problem entirely. Most microservices don't need distributed atomicity at all if the system is designed with Sagas.

The one thing to remember

3PC eliminates 2PC's blocking problem by adding a pre-commit phase that tells participants "all have voted yes" before the final commit — allowing them to complete the commit independently if the coordinator fails. The protocol works under crash failures, but breaks under network partitions (the more realistic failure mode), making inconsistent commit/abort decisions when Pre-Commit is delivered to some participants but not others. This is why 3PC is a useful theoretical concept but is almost never implemented in production systems, which use consensus-based commit (CockroachDB, Spanner) or Saga patterns instead.

← Previous: Two-Phase Commit — the classical distributed transaction protocol in detail.

→ Next: Delivery Semantics — at-most-once, at-least-once, and exactly-once: what each guarantees and what it costs.

Two-Phase Commit: Coordinating a Distributed Decision

Cloud Tuned — Fri, 07 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision ← you are here	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Two-Phase Commit: Coordinating a Distributed Decision

The problem

Three database nodes must atomically commit a transaction: either all three commit or all three abort. The coordinator must make this decision and ensure all participants do the same, even if some messages are delayed and any node can fail at any moment.

Sending "commit" to all three simultaneously doesn't work — if the message reaches two nodes but is lost to the third (network failure), two commit and one doesn't. The transaction is inconsistent.

You need a protocol that guarantees: if even one participant cannot commit (for any reason), none commit.

The core idea

Two-Phase Commit (2PC) coordinates a distributed transaction in two phases: prepare (each participant votes yes or no), then commit or abort (the coordinator sends the unanimous decision). Only after receiving a "yes" from every participant does the coordinator send "commit." Any "no" triggers an "abort" to all.

The analogy: a wedding ceremony with a final question

A wedding officiant (coordinator) asks each party: "Do you take this person to be your lawfully wedded spouse?" Each must say "yes" before the officiant declares them married.

Phase 1: The officiant asks each party privately whether they're sure. Each commits (in their heart) to saying yes, but says nothing final yet — they're in a "prepared" state.

Phase 2: If both said yes, the officiant declares them married (commit). If either hesitates or says no, the ceremony is called off (abort).

If the officiant collapses mid-ceremony after both parties have said "yes" but before the declaration — both parties are stuck in "prepared" state, unsure whether the marriage is legal. This is the 2PC blocking problem.

How 2PC works

Phase 1: Prepare

Coordinator → Prepare → Participant A
Coordinator → Prepare → Participant B
Coordinator → Prepare → Participant C

Each participant:
  - Evaluates whether it can commit (checks constraints, acquires row locks)
  - Writes a "prepared" record to its WAL (durable — will survive crash)
  - Sends Vote-Yes or Vote-No to coordinator

If Vote-Yes: participant is committed to committing if coordinator says so
             It will not abort unilaterally
If Vote-No:  participant cannot commit (constraint violation, lock conflict)

Phase 2: Commit or Abort

If all votes are Yes:
  Coordinator writes "commit" to its log (durable)
  Coordinator → Commit → all participants
  Each participant commits and releases locks

If any vote is No:
  Coordinator → Abort → all participants
  Each participant rolls back and releases locks

Why it's safe

Once a participant votes Yes, it is obligated to commit if the coordinator says commit — it cannot independently decide to abort. This obligation is what makes atomicity possible: the coordinator makes exactly one decision, and all prepared participants follow it.

The blocking problem

2PC has one critical flaw: it blocks if the coordinator fails after participants vote Yes but before the commit message is sent.

Participants A, B, C all vote Yes
Coordinator writes "commit" to log
Coordinator CRASHES before sending commit to anyone

Status:
  A, B, C: each holds locks, waiting for coordinator
  None knows whether to commit or abort
  They cannot make a safe independent decision:
    - Committing unilaterally might contradict a decision the coordinator
      will make to abort when it recovers
    - Aborting unilaterally might contradict a commit decision
  
  They must wait for coordinator to recover.
  Locks held. Blocking.

In practice, the coordinator typically recovers in seconds to minutes. But during that window, all locked rows are unavailable. For a busy OLTP database, this is a serious availability problem.

Mitigation: the coordinator writes its decision to durable storage before sending messages. On recovery, it can resume from its log and send the correct message to all participants. This reduces the blocking window to the coordinator's recovery time — but doesn't eliminate blocking entirely.

Coordinator failure edge cases

Coordinator fails before writing prepare: coordinator can abort the transaction on recovery — nothing was committed.

Coordinator fails after prepare but before commit: participants are stuck in "prepared" state holding locks until coordinator recovers. This is the blocking case.

Coordinator fails after sending some commits: some participants committed. On recovery, coordinator resends commit to the participants that hadn't acknowledged. All eventually commit — consistency is restored.

Where 2PC is used

XA transactions (PostgreSQL, MySQL, Oracle): the database's native support for distributed transactions. Used in some enterprise systems where cross-database atomicity is required.

JTA (Java Transaction API): enterprise Java applications managing transactions across databases and message brokers.

CockroachDB / Spanner internals: these use a variant of 2PC, but with Raft providing the coordinator's durability guarantee, eliminating the blocking problem.

Avoided in microservices: the lock contention, blocking on coordinator failure, and latency make 2PC unsuitable for most microservices designs. The Saga pattern (Pillar 7, post 09) is almost always preferred.

The one thing to remember

2PC guarantees atomicity across distributed participants through a prepare-then-commit protocol: all must vote yes before any commit. The fundamental weakness is blocking: if the coordinator crashes after participants vote yes but before the commit message is sent, all participants hold locks and wait. This blocking behaviour — not correctness — is what makes 2PC impractical for most microservices designs, and why Saga patterns and distributed databases with native consensus (CockroachDB, Spanner) are preferred.

← Previous: Distributed Transactions — ensuring atomicity across multiple nodes or services when a single ACID transaction isn't possible.

→ Next: Three-Phase Commit — the protocol designed to eliminate 2PC's blocking problem, and why it's rarely used in practice despite solving it.

Distributed Transactions: When One Machine Isn't Enough

Cloud Tuned — Thu, 06 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough ← you are here	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Distributed Transactions: When One Machine Isn't Enough

The problem

A user upgrades their URL shortener plan from Free to Pro. This must atomically:

Charge their credit card (Stripe API)
Update their plan in the User Service database
Increase their link quota in the Link Service database

In a single-database monolith, this is one transaction — either all three succeed or all three roll back. The database guarantees atomicity.

In a microservices architecture with three separate systems (Stripe, User DB, Link DB), there is no single transaction manager. If step 2 succeeds and step 3 fails, the user has paid for Pro but their quota wasn't increased. If step 1 succeeds and step 2 fails, the user was charged but their account shows Free tier. Both are bad.

This is the distributed transaction problem: ensuring atomicity (all-or-nothing) across multiple independent data stores or services.

The core idea

A distributed transaction provides ACID guarantees across multiple nodes or services. It's much harder than a local transaction because it requires coordinating the commit or abort decision across participants that can each independently fail or become unreachable. The fundamental difficulty: how do you commit atomically when the network might drop your commit message?

The analogy: signing a contract with multiple parties

A business deal requires signatures from Company A, Company B, and a lawyer as witness. The rule: the deal is finalised only when all three sign; if any refuses or is unreachable, the deal is null and void.

In person, this is easy — everyone signs in the same room simultaneously. Over distance, it's hard: Company A signs and sends to Company B, which signs and sends to the lawyer. What if Company B's courier is delayed? Company A has signed; Company B hasn't. The deal is in limbo. Who decides whether to proceed?

A distributed transaction coordinator is the resolution: it collects "ready to sign" confirmations from all parties before telling anyone to finalise.

When distributed transactions are needed vs avoidable

Genuinely needed:

Financial operations that must be consistent across multiple stores (charge + grant access must be atomic)
Cross-service invariants that must be maintained (inventory count in one service and order record in another)
Database migrations that span shards (moving a record from shard A to shard B must not leave it in both or neither)

Avoidable with eventual consistency:

Most microservices communication (Saga pattern + Outbox — Pillar 7, posts 09–10)
Analytics, notifications, search index updates
Any operation where "we'll fix inconsistency later" is acceptable

The key insight: most operations that seem to require distributed transactions actually don't, if you're willing to accept brief eventual inconsistency and design idempotent recovery. The Saga pattern (Pillar 7, post 09) handles most cross-service coordination without distributed transactions.

The approaches

2PC (Two-Phase Commit)

The classical protocol. A coordinator node:

Prepare phase: asks all participants "can you commit?" Each votes yes or no and locks its resources
Commit phase: if all vote yes, coordinator sends commit to all; if any votes no, sends abort to all

Guarantees atomicity. Problems: blocking on coordinator failure (covered in post 14), lock contention (participants hold locks during the prepare phase, waiting for the coordinator's decision).

Saga pattern (Pillar 7, post 09)

Not a true distributed transaction — uses local transactions with compensating actions. Eventual consistency: the system may be inconsistent for a brief period. The right choice for most microservices cross-service operations.

Distributed databases with native support (CockroachDB, Spanner)

CockroachDB and Google Spanner provide distributed ACID transactions natively. They use consensus (Raft/Paxos) to ensure all shards agree on the transaction outcome. This is the correct tool when you need true distributed atomicity without implementing it yourself.

CockroachDB executes transactions across multiple shards using a protocol similar to 2PC, but with Raft providing the durability guarantee at each shard — eliminating the "coordinator crashes, transaction is stuck" problem of classic 2PC.

XA transactions (distributed database standard)

XA is a standard for distributed transactions used by traditional enterprise databases. Supported by MySQL, PostgreSQL, Oracle. Allows transactions that span multiple database servers with a two-phase commit.

In practice, XA is rarely used in modern microservices — the coordination overhead is high and it requires all participants to be XA-aware. Saga patterns are almost universally preferred.

The real cost of distributed transactions

Lock contention. In 2PC, all participants hold locks on affected rows from the prepare phase until the commit message arrives. If the coordinator is slow or crashes, locks are held indefinitely — blocking all other reads and writes to those rows.

Coordinator availability. The transaction coordinator is a single point of failure. If it crashes after participants voted "yes" but before sending commit, the transaction is stuck in "prepared" state until the coordinator recovers. This is the 2PC blocking problem (covered in post 14).

Latency. At minimum, a distributed transaction requires two round trips (prepare + commit) between the coordinator and all participants. For cross-region transactions (London, Singapore, São Paulo), this is 3 × cross-continent RTT = 300–600ms per transaction.

Operational complexity. Stuck transactions ("in-doubt" transactions) require manual intervention or automated recovery. Every DBA who has worked with distributed databases has encountered stuck 2PC transactions that had to be manually resolved.

The one thing to remember

Distributed transactions are expensive and fragile: they require coordination across multiple independent systems, hold locks during the coordination period, and block if the coordinator fails. Most cross-service operations that appear to need distributed transactions can be handled with the Saga pattern (eventual consistency with compensating actions) at lower cost and higher availability. Reserve distributed transactions for cases where eventual consistency genuinely isn't acceptable — and prefer databases (CockroachDB, Spanner) that implement them natively over rolling your own 2PC coordinator.

← Previous: Vector Clocks — extending Lamport timestamps to detect concurrency: when two events are causally independent, vector clocks make that explicit.

→ Next: Two-Phase Commit — the classical distributed transaction protocol in detail.

Vector Clocks: Knowing When Events Are Truly Concurrent

Cloud Tuned — Wed, 05 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent ← you are here	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Vector Clocks: Knowing When Events Are Truly Concurrent

The problem

Lamport timestamps tell you: if A → B, then ts(A) < ts(B). What they don't tell you: if ts(A) < ts(B), does that mean A caused B, or were they concurrent?

Consider two users simultaneously editing the same link's destination URL on different servers. Lamport timestamps will order them (one will have a lower timestamp than the other), but this ordering is arbitrary — neither write caused the other. They're concurrent. Silently picking the "higher timestamp" write and discarding the other loses data without the application knowing a conflict occurred.

Vector clocks solve this: they can distinguish between "A happened-before B" and "A and B are concurrent and both happened independently." This distinction is the basis of conflict detection in distributed databases.

The core idea

A vector clock is a list of counters, one per process, that tracks each process's knowledge of every other process's logical time. Comparing two vector clocks tells you definitively: A happened-before B, B happened-before A, or A and B are concurrent.

The analogy: tracking who heard what from whom

Three gossips — Alice, Bob, and Carol — each keep a tally of how many things each person has told them:

Alice's tally: [Alice: 3, Bob: 2, Carol: 1] — "I've made 3 statements, heard 2 from Bob, 1 from Carol"
Bob's tally: [Alice: 2, Bob: 4, Carol: 2]

When Alice shares her tally with Bob, Bob updates each entry to the maximum of their two tallies. Bob now knows everything Alice knows.

If Bob's tally for a dimension is higher than Alice's for the same dimension, Bob has information Alice doesn't. If both are lower in some dimensions, their states are incomparable — they've heard different things and both have newer information in some respects. That's concurrency.

How vector clocks work

Each process i maintains a vector V of length N (one entry per process). V[j] = the number of events process i knows about from process j.

On a local event at process i: V[i] += 1

On sending a message from process i: V[i] += 1, send message with current V

On receiving a message at process i with vector W: V[j] = max(V[j], W[j]) for all j, then V[i] += 1

Comparing vector clocks

Vector clock A happened-before B (A → B) iff: A[i] ≤ B[i] for all i, and A[j] < B[j] for at least one j.

Vectors A and B are concurrent (A ‖ B) iff: neither A → B nor B → A. This means A has a higher counter in some dimension and B has a higher counter in another.

Process A    Process B    Process C
V_A=[0,0,0]  V_B=[0,0,0]  V_C=[0,0,0]

A has event: V_A=[1,0,0]
A sends to B with V=[1,0,0]

B receives:  V_B = max([0,0,0],[1,0,0]) + B++ = [1,1,0]
B has event: V_B=[1,2,0]

A has event (concurrent, no communication): V_A=[2,0,0]

Now compare V_A=[2,0,0] and V_B=[1,2,0]:
  V_A[0]=2 > V_B[0]=1 → A has something B doesn't
  V_A[1]=0 < V_B[1]=2 → B has something A doesn't
→ CONCURRENT: neither happened-before the other ✓

Application: Dynamo and conflict detection

Amazon Dynamo (and Riak, which is an open-source Dynamo implementation) uses version vectors (a variant of vector clocks) for conflict detection on writes.

When a client reads a key, Dynamo returns the value along with its version vector (a "context"). When the client writes back, it includes this context. Dynamo uses the context to determine if the write is a successor to the current value (no conflict) or concurrent with it (conflict).

Initial: key="x7Kp2", value="https://old.com", VC=[A:1, B:0]

Client 1 reads (gets VC=[A:1, B:0]), updates to "https://v2.com"
Writes with context VC=[A:1, B:0] → stored on Server A
  Server A: value="https://v2.com", VC=[A:2, B:0]

Client 2 (concurrently, read the old value) updates to "https://v3.com"
Writes with context VC=[A:1, B:0] → stored on Server B
  Server B: value="https://v3.com", VC=[A:1, B:1]

Reconciliation:
  [A:2, B:0] vs [A:1, B:1]: concurrent (A has A:2 > B:0, B has B:1 > A:0)
  → CONFLICT: surface both values to the application for resolution

In Dynamo, conflicting versions are returned to the next reader as a list ("siblings"). The application (or the client library) resolves the conflict and writes back a merged version.

The vector clock size problem

A vector clock has one entry per process. With 100 processes, each vector clock is 100 entries. At scale, this becomes expensive to store and transmit.

Dynamo addresses this with dotted version vectors (a more compact representation) or version vectors (track client IDs rather than server IDs). Riak and newer Dynamo implementations use these to avoid unbounded clock growth.

Vector clocks vs Lamport timestamps

	Lamport Timestamps	Vector Clocks
Detects A → B	Yes	Yes
Detects concurrency	No	Yes
Size	1 integer	N integers (one per process)
Use case	Total order, simple causality tracking	Conflict detection, concurrent write detection

The one thing to remember

Vector clocks track per-process knowledge so that comparing two events tells you definitively whether one happened-before the other or they're concurrent. A happened-before B means A's every counter ≤ B's corresponding counter (and at least one is strictly less). Concurrent means each has a higher counter in at least one dimension — neither subsumes the other. This is the mechanism behind conflict detection in Dynamo and Riak: concurrent writes are surfaced to the application for resolution, not silently discarded.

← Previous: Lamport Timestamps — the simplest logical clock: assigning monotonic integers to events to capture causal order.

→ Next: Distributed Transactions — ensuring atomicity across multiple nodes or services when a single ACID transaction isn't possible.

Lamport Timestamps: Ordering Events Without a Global Clock

Cloud Tuned — Tue, 04 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock ← you are here	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Lamport Timestamps: Ordering Events Without a Global Clock

The problem

Three services — Link Service, Analytics Service, and Billing Service — exchange messages. We need to determine the order in which events occurred to reconstruct causality during an incident.

Physical timestamps are unreliable (clocks drift). We need a logical clock: a mechanism that assigns order numbers to events such that if event A caused event B, A's number is less than B's.

The core idea

A Lamport timestamp is a monotonically increasing integer counter maintained per process. The rule: before any event, increment the counter. When sending a message, include the current counter. When receiving a message, take the max of the local counter and the received counter, then increment.

This ensures: if A happened-before B (A → B), then timestamp(A) < timestamp(B). The converse is not guaranteed: timestamp(A) < timestamp(B) does not imply A → B — it might just mean A happened on a process with a lower counter.

The analogy: version numbers in a document history

A shared document has a version counter. Every edit increments the version. When two users collaborate, the server takes the higher version and increments it — ensuring every merged version is higher than either contributor's version.

If version 47 preceded version 48, version 47's changes are causally "before" version 48. If two separate branches both produced version 47 independently, they're concurrent — the counter alone can't tell you this.

The algorithm

Each process maintains a local counter C.

On any local event: C = C + 1

On sending a message: C = C + 1, send message with timestamp C

On receiving a message with timestamp T: C = max(C, T) + 1

Process A (C_A):       Process B (C_B):       Process C (C_C):
C_A = 0                C_B = 0                C_C = 0

Event a1: C_A = 1
Send msg to B (ts=1) →
                       Receive msg (ts=1):
                       C_B = max(0,1)+1 = 2
                       Event b1: C_B = 3
                       Send msg to C (ts=3) →
                                              Receive msg (ts=3):
                                              C_C = max(0,3)+1 = 4
                                              Event c1: C_C = 5

Event a2: C_A = 2
(no message, concurrent with b1)

From this, we can order: a1(1) < b1(3) < c1(5). And a2(2) is ordered between a1 and b1 in total order, even though a2 is causally independent of b1.

What Lamport timestamps guarantee

If A → B, then ts(A) < ts(B). Always true. This is the useful property: causal ordering is preserved.

If ts(A) < ts(B), we can't conclude A → B. Not true. Low timestamp might just mean a different process. Two concurrent events on different processes could have any relationship between their timestamps.

Total order from Lamport timestamps

Lamport timestamps create a total order: any two events can be compared (ts(A) < ts(B) or ts(A) > ts(B) or ts(A) = ts(B) → break ties by process ID). This total order is consistent with the causal partial order (no causal predecessor has a higher timestamp), but it adds arbitrary ordering for concurrent events.

This total order is used by:

Distributed mutex algorithms: agree on the order in which processes acquire a lock using Lamport timestamps
Event log ordering in debugging: total-order all events across services for incident reconstruction
Serialisable transaction ordering: CockroachDB uses Hybrid Logical Clocks (HLC) — a Lamport-like structure that combines logical and physical time — for transaction ordering

Limitations

Can't detect concurrency. The most significant limitation: ts(A) < ts(B) doesn't tell you whether A happened-before B or they were concurrent. To detect concurrency, you need vector clocks (post 12).

Not useful for conflict detection. If two writes to the same key have different Lamport timestamps, you can order them — but you can't tell if they were concurrent (which would indicate a conflict). Last-write-wins based on Lamport timestamps silently discards one concurrent write without flagging it as a conflict.

The one thing to remember

Lamport timestamps give a simple rule: if event A causally preceded event B, A's timestamp is strictly less than B's. They create a total order consistent with causality, useful for ordering events across services without a global clock. The limitation: they can't detect whether two events with different timestamps are causally related or merely concurrent. For conflict detection, you need vector clocks.

← Previous: Logical Clocks — in distributed systems, physical clocks can't be trusted; logical clocks capture causal relationships between events instead.

→ Next: Vector Clocks — extending Lamport timestamps to detect concurrency: when two events are causally independent, vector clocks make that explicit.

Logical Clocks: When Physical Time Isn't Enough

Cloud Tuned — Mon, 03 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough ← you are here	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Logical Clocks: When Physical Time Isn't Enough

The problem

Your URL shortener's Analytics Service runs on three servers. Each server has a real-time clock. Server A says it's 14:00:00.000. Server B says it's 14:00:00.003. Server C says it's 13:59:59.997.

These three clocks disagree by a few milliseconds. That seems fine — until you need to order events:

Server A records a click at 14:00:00.001
Server C records the same link's metadata update at 13:59:59.998

Based on physical timestamps, the metadata update (C, 13:59:59.998) happened before the click (A, 14:00:00.001). But in reality, the metadata update was triggered by a notification that included information from the click — the click caused the update, not the other way around.

Physical clocks in distributed systems drift. NTP can synchronise them to within a few milliseconds, but "a few milliseconds" matters when events happen microseconds apart on different machines. You cannot rely on physical timestamps to establish causal order between events on different machines.

This is the problem logical clocks solve.

The core idea

A logical clock doesn't measure real time — it measures causal order: whether one event happened-before another. The fundamental insight from Leslie Lamport's 1978 paper: if event A caused event B (A's message was received by the process that executed B), then A happened-before B. If A and B are on different machines with no causal connection, they are concurrent — neither happened before the other, regardless of their physical timestamps.

The analogy: chapter numbers in a book exchange

Two authors are collaborating on a book, each writing different chapters and emailing drafts to each other. The emails take random amounts of time to arrive.

Author A writes Chapter 3, then emails it to Author B. Author B reads it and responds with Chapter 4, which incorporates Chapter 3's ideas. A simple fact: Chapter 4 happened-after Chapter 3 — not because the clock says so, but because Chapter 4 causally depends on Chapter 3.

If Author C (who wasn't in the exchange) writes Chapter 5 independently at the same time, Chapter 5 is concurrent with Chapter 3 and Chapter 4 — neither happened-before the other in the causal sense.

Logical clocks are chapter numbers: they track the causal chain of revisions, not wall-clock time.

The happened-before relation

Lamport defined the happened-before relation (→) with three rules:

Within one process: if event A occurs before event B in the same process, then A → B
Message passing: if A is the sending of a message and B is the receipt of that message, then A → B
Transitivity: if A → B and B → C, then A → C

Events that are neither A → B nor B → A are concurrent (written A ‖ B).

This relation is a partial order — not every pair of events is ordered. Concurrent events have no meaningful "before/after" relationship.

Why this matters

Conflict resolution: in Cassandra, if two writes to the same key are concurrent (neither happened-before the other), it's a conflict. Last-write-wins (using timestamps) may silently discard one. Vector clocks can detect this and surface the conflict for application-level resolution.

Debugging distributed systems: reconstructing the causal chain of events in a distributed incident requires happened-before order, not just physical timestamps. "This cache miss happened before that database query" is a causal statement, not a temporal one.

Consistency in replicated systems: when a client reads its own writes, the system must guarantee it sees events that happened-after its writes — not just events that have a later physical timestamp.

Partial vs total order

Physical clocks give a total order: every event has a wall-clock time, and times can always be compared. The problem: the comparison is unreliable (clocks drift) and doesn't reflect causality.

Happened-before gives a partial order: some events are ordered (causally related), others are concurrent (no causal relationship). More accurate but doesn't order everything.

Logical clock implementations (Lamport timestamps, vector clocks) bridge the gap:

Lamport timestamps: extend partial order to total order by breaking ties arbitrarily. Simple, but can't detect concurrency.
Vector clocks: preserve partial order, detect concurrency explicitly. More complex, but more informative.

The one thing to remember

Physical clocks can't reliably establish causal order in distributed systems — they drift, and milliseconds of disagreement matter. Logical clocks track causal relationships instead: if A caused B (via message passing or within a process), A happened-before B. This partial order is what distributed systems actually need for correctness: detecting conflicts, establishing read-your-writes guarantees, and reconstructing event causality in debugging. Lamport timestamps and vector clocks (the next two posts) are the concrete implementations of this idea.

← Previous: Gossip Protocol — decentralised cluster membership and state propagation without a leader or consensus.

→ Next: Lamport Timestamps — the simplest logical clock: assigning monotonic integers to events to capture causal order.

Gossip Protocol: Decentralised Cluster Communication

Cloud Tuned — Sun, 02 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication ← you are here	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Gossip Protocol: Decentralised Cluster Communication

The problem

Your Cassandra cluster has 100 nodes. Each node needs to know the health status of every other node — which are alive, which are dead, which are handling which token ranges. With 100 nodes, that's each node needing updates from 99 others.

A centralised approach: one coordinator node receives status from all nodes and broadcasts updates. The coordinator is a single point of failure and a bottleneck for a 100-node cluster.

A direct broadcast: each node broadcasts its status to all 99 others. With 100 nodes sending 99 messages each per second, that's 9,900 messages per second — fine at 100 nodes, but quadratically worse as the cluster grows. A 1,000-node cluster would generate nearly a million messages per second just for status updates.

Gossip protocols solve this with a different model: each node only talks to a few random peers, but information spreads exponentially fast across the cluster through the chain of random communications.

The core idea

A gossip protocol (also called epidemic protocol) propagates information through a cluster by having each node periodically select a small number of random peers and exchange information with them. Like a rumour spreading through a social network, information reaches every node in O(log N) rounds — logarithmically fast — without any central coordinator and without each node communicating with all others directly.

The analogy: rumour spreading through a school

A rumour starts with one student. Every few minutes, each student who knows the rumour tells a randomly selected other student. After one round, 2 students know. After two rounds, ~4 know. After three rounds, ~8. After about 7 rounds (log₂(100)), essentially everyone knows.

No one is in charge. No one coordinates who tells whom. The rumour spreads naturally through random pairwise contact. This is gossip: decentralised, scalable, resilient. If a few students are absent, the rumour still spreads — it just takes one extra round to route around them. Check our interactive diagram below:

How gossip works

The basic mechanism

Every T seconds (typically 1 second), each node:

Selects 1–3 random peers from its known cluster membership
Sends its current state vector to those peers
Receives their state in return
Updates its local view of cluster state with anything newer than what it already has

Node A gossips with Node B:
  A sends: {A: (alive, generation=5, version=42), C: (alive, gen=3, ver=19)}
  B sends: {B: (alive, gen=4, ver=31), C: (alive, gen=3, ver=20), D: (suspected, gen=2)}

After exchange:
  A now knows: B is at version 31, C is at version 20 (newer than 19), D is suspected
  B now knows: A is at version 42

The version number (or generation+version pair) identifies the currency of each node's information about every other node. Higher version = more recent.

Convergence rate

In a cluster of N nodes with each node gossiping with k peers per round, new information reaches all nodes within O(log_k(N)) rounds. For k=3 and N=100: log₃(100) ≈ 4.2 rounds. With 1-second gossip intervals, the entire cluster learns new state within ~5 seconds.

This is remarkably efficient: 100 nodes, each sending 3 messages per second = 300 total messages per second for cluster-wide state propagation — constant, regardless of cluster size growth to moderate scales.

Failure detection via gossip

Cassandra uses gossip for failure detection in combination with heartbeats:

Each node gossips its own heartbeat counter (incrementing every second) to its peers
If a peer's heartbeat counter hasn't increased in the expected time, its "phi" suspicion level rises (the phi accrual detector from post 03)
When φ exceeds a threshold, the node is marked as suspected, then down

Marking a node down propagates through gossip: node A marks B as down, gossips to C and D, who update their views and gossip to E and F. Within seconds, the whole cluster considers B down.

What gossip is used for

Cluster membership: which nodes are in the cluster, which are healthy, which are joining or leaving. Cassandra, Redis Cluster, Consul.

Ring topology in consistent hashing: when a node joins or leaves the Cassandra cluster, its token assignments propagate via gossip. Within seconds, all nodes know the updated ring.

Data anti-entropy (Cassandra): Cassandra periodically gossips about Merkle tree hashes of its data to identify and repair inconsistencies between replicas (covered in post 19).

Service health in Consul: Consul agents gossip health check results. A failing service's unhealthiness propagates to all agents within seconds without any central coordinator.

Tradeoffs

Eventual consistency of the cluster view. Gossip doesn't provide strong consistency — there's always a brief window where different nodes have different views of cluster state. A node that was just marked down may still be in some nodes' "alive" view for a few seconds. This is acceptable for membership and health propagation; it's not suitable for consensus or coordination decisions (use Raft/Paxos for those).

Bandwidth scales with state size, not cluster size. Each gossip message carries the state of the sender's view of the cluster. If each node stores metadata about N other nodes, gossip messages grow with N. For very large clusters (thousands of nodes), this can become expensive. Solutions: cap the amount of state included per gossip round, or use hierarchical gossip (gossip within a zone, then between zones).

No total ordering. Gossip delivers information eventually but not in any guaranteed order. If two events happen simultaneously on different nodes, different nodes may learn about them in different orders. This is fine for membership changes; it's not acceptable for ordered log replication.

The one thing to remember

Gossip protocols spread information across a cluster in O(log N) rounds by having each node exchange state with a few random peers — no central coordinator, no single point of failure. Each individual exchange is cheap; the collective effect is rapid, resilient propagation. Cassandra, Redis Cluster, and Consul all use gossip for cluster membership because it scales naturally, tolerates node failures, and requires no dedicated coordination infrastructure. The cost: eventual consistency of the cluster view, not instantaneous agreement.

← Previous: Raft — the consensus algorithm designed for understandability; the most important one to know in practice.

→ Next: Logical Clocks — in distributed systems, physical clocks can't be trusted; logical clocks capture causal relationships between events instead.

Raft: Consensus Made Understandable

Cloud Tuned — Sat, 01 Mar 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable ← you are here	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Raft: Consensus Made Understandable

The problem

Paxos is correct. It is also notoriously difficult to understand, implement correctly, and explain to engineers who need to reason about the systems built on it. In a 2014 research paper, Diego Ongaro and John Ousterhout described this plainly: they surveyed graduate students and found Paxos was widely considered "opaque and difficult." They designed Raft with one primary goal: understandability.

The result is an algorithm that provides the same safety guarantees as Paxos with a structure that engineers can explain, implement, and debug. Raft is now the most widely used consensus algorithm in production systems: etcd, CockroachDB, TiKV, Consul, and CockroachDB all use Raft. Kubernetes depends on etcd. Half the distributed databases shipped in the last decade depend on Raft.

The core idea

Raft decomposes consensus into three relatively independent subproblems: leader election (choosing one node to coordinate), log replication (the leader accepts entries and replicates them to followers), and safety (ensuring no two nodes ever commit different values at the same log index). Each subproblem is tractable independently; together they provide a complete consensus protocol.

The analogy: a parliamentary legislature with a speaker

A parliament has a Speaker (the leader) who controls the floor. Only the Speaker can introduce legislation (log entries). When the Speaker proposes a bill, members vote to pass it or not. If the majority pass it, it becomes law (committed). If the Speaker is absent, parliament elects a new Speaker before proceeding.

The legislature's rule: all laws are passed in the order the Speaker introduced them. No law is rescinded once passed. If the Speaker changes, the new Speaker inherits all previously passed laws and introduces new ones from where the old Speaker left off.

This is Raft's model, exactly.

How Raft works

Terms

Raft uses terms — monotonically increasing logical time units — to distinguish legitimate leaders from stale ones. Each election starts a new term. A node that receives a message with a higher term immediately updates its term and reverts to follower. This prevents an old leader from reasserting authority after recovering from a partition.

Leader election

All nodes start as followers. A follower that doesn't receive a heartbeat from the leader within electionTimeout becomes a candidate and starts an election:

Increments its term
Votes for itself
Sends RequestVote RPCs to all other nodes

A node grants its vote if: it hasn't voted in this term yet, AND the candidate's log is at least as up-to-date as its own (more recent last entry term, or same term and longer log).

The candidate wins if it receives votes from a majority (⌊N/2⌋ + 1 of N nodes, including itself). It immediately begins sending heartbeats to prevent new elections.

If no candidate wins (split vote), the election times out and a new election starts with a higher term. Randomised election timeouts (each node picks a random delay before starting a candidacy) break ties and ensure elections converge quickly.

Log replication

The leader accepts client commands and appends them to its log as new entries. Each entry has an index, term number, and the command.

The leader sends AppendEntries RPCs to all followers with the new entry. Followers append the entry if it's consistent with their log (the entry before it matches). The leader waits for a majority of nodes to acknowledge the entry.

Once a majority acknowledge, the entry is committed — it will survive any future leader changes. The leader applies the entry to its state machine and returns the result to the client.

Log:
Index: 1    2    3    4    5
Term:  1    1    2    2    2
Cmd:   w1   w2   w3   w4   w5 (uncommitted)
                  ↑
           commit index (majority acknowledged through here)

Followers learn the commit index via subsequent AppendEntries RPCs and apply committed entries to their state machines.

Log consistency guarantee

Raft maintains a critical invariant: if two logs have the same term and index for an entry, all preceding entries are identical. This is enforced by the consistency check in AppendEntries: before accepting a new entry, the follower verifies that the entry immediately before it matches (same term, same index). If not, it rejects the append — the leader re-sends earlier entries until consistency is established.

Safety: the log matching property

A key safety property: Raft never allows two different committed entries at the same log index. This follows from:

Only one leader per term (from the election guarantee)
A leader never overwrites committed entries — it only appends
The vote restriction: a candidate can only win if its log is at least as up-to-date as any voter's — ensuring the winner has all committed entries

These three properties together guarantee no two nodes ever commit different values at the same index.

Leader changes and log reconciliation

When a new leader is elected, its log may be behind followers' logs (it wasn't the most recent leader). Raft's approach: the new leader's log is authoritative. Followers that have entries beyond the leader's log have those entries overwritten (they were never committed — committed entries survive leader changes; uncommitted ones may not).

The leader sends its log to each follower, which the follower matches against its own and overwrites any divergent tail.

Raft in etcd: the Kubernetes brain

Kubernetes stores all cluster state in etcd — pod specs, service definitions, configuration, secrets. etcd uses Raft for its replicated log. Every etcd write goes through Raft's commit process: proposed to the leader, replicated to followers, committed when a majority acknowledge.

An etcd cluster of 3 nodes tolerates 1 failure. 5 nodes tolerates 2 failures. The API server, scheduler, and controller manager all read from etcd. If etcd loses quorum (more than half its nodes fail), Kubernetes cannot make decisions — scheduling stops, deployments freeze.

This is the price of strong consistency: etcd will never return incorrect data, but it may be unavailable if enough nodes fail.

Tradeoffs

Leader is the bottleneck. All writes route through the leader. For high-write systems, a single Raft group's throughput is limited by one node's capacity. Systems like CockroachDB and TiKV shard data across many independent Raft groups (one per key range) to parallelise write throughput.

Quorum requirement limits availability. A Raft group of 3 requires 2 to be reachable. If one node is in a slow availability zone and another crashes, the cluster loses quorum and stops accepting writes. Operating in 3 availability zones with 3 nodes is the typical pattern for tolerable availability with strong consistency.

Read linearity requires care. A stale read from a follower is possible — the follower may not have applied the latest committed entries. Linearisable reads route through the leader or use a read lease mechanism.

The one thing to remember

Raft achieves consensus through an elected leader that replicates log entries to a majority of followers before committing. Safety comes from three rules: only one leader per term, leaders never overwrite committed entries, and candidates must have an up-to-date log to be elected. Raft is the most important consensus algorithm to understand in practice — it directly powers Kubernetes (via etcd), CockroachDB, TiKV, and Consul. When you interact with any of these systems, you're interacting with Raft's guarantees.

← Previous: Paxos — the foundational consensus algorithm, notoriously difficult to understand but the basis of many production systems.

→ Next: Gossip Protocol — decentralised cluster membership and state propagation without a leader or consensus.

Paxos: The Algorithm That Started It All

Cloud Tuned — Fri, 28 Feb 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All ← you are here	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Paxos: The Algorithm That Started It All

The problem

You need five distributed nodes to agree on a single value — which server should be the leader, what the value of a configuration key should be, or whether a transaction should commit. Any node can propose a value. Any node can fail. Messages can be delayed, lost, or reordered. No node can know for certain what other nodes have decided.

This is the consensus problem. Paxos, proposed by Leslie Lamport in 1989 (published 1998), was the first algorithm to prove this problem solvable in an asynchronous network with crash failures. Everything that followed — Multi-Paxos, Raft, Zab — is built on its foundation.

The core idea

Paxos runs in two phases. In Phase 1 (Prepare), a proposer asks a majority of acceptors to promise not to accept older proposals. In Phase 2 (Accept), the proposer sends its chosen value; acceptors accept it if they haven't made a newer promise. Once a majority of acceptors accept the same value, it is chosen — irrevocably.

The analogy: reserving a meeting room

You want to book a conference room. The booking system has three administrators (acceptors). To book, you:

Phase 1 (Prepare/Promise): Call all three admins: "I'm proposing booking ID #42. Will you promise not to accept any booking with an ID lower than 42?"

If they've already promised a higher number, they say "No (and by the way, the highest I've seen is #48)"
If #42 is the highest they've seen, they say "Yes, I promise"

If a majority promise: proceed to Phase 2. If not: your booking ID is too low — try again with a higher ID.

Phase 2 (Accept/Commit): Tell the same admins: "Please record this booking for ID #42: Meeting room B, 3pm Tuesday"

If they haven't promised anything higher than #42 since Phase 1: they accept
If they have (another proposer snuck in): they reject

If a majority accept: the booking is made. It can never be unmade. If not: start over with a higher proposal number.

How Paxos works in detail

Roles

Proposer: a node that initiates a consensus round by proposing a value.

Acceptor: a node that receives proposals and accepts or rejects them. Maintains the highest proposal number it has seen and the last value it accepted.

Learner: a node that learns the decided value (the chosen value) to act on it.

In practice, nodes often play all three roles.

Phase 1: Prepare

The proposer selects a globally unique, monotonically increasing proposal number n. It sends Prepare(n) to a majority of acceptors.

Each acceptor responds with a promise: it will not accept any proposal numbered less than n. If the acceptor has previously accepted a value, it includes that value in its response (so the proposer knows about any previously accepted values it must preserve).

Proposer → Prepare(n=5) → Acceptors A, B, C
A: hasn't accepted anything → Promise(n=5, accepted=null)
B: previously accepted (n=3, v="foo") → Promise(n=5, accepted=(3, "foo"))
C: doesn't respond (failed or delayed)

Proposer receives promises from A and B (majority of 3) → can proceed.

Phase 2: Accept

The proposer picks a value:

If any acceptor reported a previously accepted value, the proposer must use the value from the highest-numbered accepted proposal (v="foo" from n=3 in the example above)
If no acceptor reported a prior accepted value, the proposer uses its own proposed value

The proposer sends Accept(n, v) to a majority.

Each acceptor accepts if it hasn't promised a higher proposal number since Phase 1.

Proposer → Accept(n=5, v="foo") → Acceptors A, B, C
A: promised n=5, no higher → Accept(5, "foo") ✓
B: promised n=5, no higher → Accept(5, "foo") ✓

When a majority accept the same (n, v), the value v is chosen (committed). It will be the decision regardless of future rounds.

Why Paxos is safe

No two values can be chosen: only one proposal can achieve majority acceptance at any given proposal number (acceptors only accept one proposal per number). Any subsequent majority must include at least one node that accepted the current value, forcing the proposer to propagate it.

The value of a chosen value is preserved: if value v was chosen in round n, any higher-numbered round's Phase 1 will discover the previously accepted v and must propose it — never a different value.

Why Paxos is hard to implement correctly

Liveness is not guaranteed. If two proposers compete with increasing proposal numbers, they can livelock indefinitely: proposer A interrupts proposer B's Phase 1 with a higher number, B does the same to A, neither completes. Randomised backoff or a distinguished proposer (Multi-Paxos) solves this.

The algorithm doesn't specify how to handle leader crashes. What happens to an in-progress proposal when the proposer fails? Paxos proves the value is safe — no conflicting decision will be made — but recovering the current state requires a new proposal round.

Multi-Paxos for log replication. Single-Decree Paxos agrees on one value. For a replicated log, you need Multi-Paxos: a stable leader that skips Phase 1 for log entries (the leader's authority was established once during its election), running only Phase 2 for each entry. This is how Google Chubby, Spanner, and Zookeeper work.

Implementation requires careful handling of: network message deduplication, acceptor state persistence (must survive crashes), leader leases, read linearizability, and cluster membership changes. Chubby's implementation note (Burrows, 2006) lists dozens of non-obvious correctness considerations.

Where Paxos is used in production

Google Chubby: distributed lock service. Uses Multi-Paxos for its replicated log. Chubby powers leader election across most of Google's infrastructure.

Google Spanner: uses Paxos groups (one Paxos instance per data shard) for globally-consistent distributed transactions.

Apache Zookeeper: uses Zab (ZooKeeper Atomic Broadcast), a Paxos variant, for its replicated log. Zookeeper coordinates Kafka, HBase, and many other distributed systems.

CockroachDB: uses Multi-Raft (Raft, which is a more understandable reformulation of Multi-Paxos) for its replicated key-value store.

The one thing to remember

Paxos guarantees that a distributed system agrees on exactly one value by ensuring that any majority of acceptors that accepts a value has already promised not to accept conflicting values. The two-phase structure — prepare (gather promises) and accept (commit the value) — is the foundation of all practical consensus algorithms. Paxos is correct but notoriously difficult to implement; this is why Raft was created to express the same guarantees more clearly.

← Previous: Quorum — the minimum vote count that makes consensus safe, and how choosing the right quorum size balances consistency against availability.

→ Next: Raft — the consensus algorithm designed for understandability; the most important one to know in practice.

Quorum: How Many Nodes Must Agree?

Cloud Tuned — Thu, 27 Feb 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree? ← you are here	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Quorum: How Many Nodes Must Agree?

The problem

Your Cassandra cluster has five nodes. You write a record with replication factor 3 (the record is stored on 3 of the 5 nodes). A network partition isolates one of those three replica nodes. Another replica is temporarily overloaded and slow.

How many replicas must acknowledge the write before you return success to the client? And how many replicas must agree on a read before you return the result?

If you wait for all three replicas: high consistency, but the write fails if any replica is unavailable. At three nines availability, that's about 8 hours of downtime per year per write operation.

If you wait for just one replica: high availability, but you might read from a replica that hasn't received the latest write. Data is inconsistent.

If you wait for two of three: a quorum. A majority. The write succeeds if any two replicas respond. Reads can check two replicas and return the latest value. And — critically — the read quorum and write quorum overlap: any two nodes that acknowledge a write will include at least one node that participates in any read of two nodes.

This overlap is the key insight behind quorum-based consistency.

The core idea

A quorum is the minimum number of nodes that must participate in an operation for it to be considered valid. By ensuring that the read quorum and write quorum overlap, a distributed system can guarantee that a read will always include at least one node that has the latest write — without requiring all replicas to be available.

The analogy: a jury that requires more than half

A criminal verdict requires the agreement of at least 7 of 12 jurors (more than half) in many jurisdictions. For an acquittal to be valid, it also requires majority agreement. This ensures that no valid conviction and valid acquittal can exist simultaneously — the two quorums (conviction and acquittal) must overlap, so a majority agreeing to one precludes a majority agreeing to the other.

In distributed systems: the write quorum and read quorum must overlap, ensuring every valid read includes at least one node that participated in the last valid write.

How quorum works

The R + W > N condition

Given:

N = replication factor (number of replicas storing the data)
W = write quorum (number of replicas that must acknowledge a write)
R = read quorum (number of replicas that must respond to a read)

Consistency is guaranteed when R + W > N.

This ensures the read set and write set overlap by at least one node. That overlapping node has the latest written value.

N=3, W=2, R=2: R+W = 4 > 3 ✓ — consistent
N=3, W=1, R=3: R+W = 4 > 3 ✓ — consistent (but R=3 means all replicas must be available)
N=3, W=1, R=1: R+W = 2 ≤ 3 ✗ — not consistently guaranteed
N=5, W=3, R=3: R+W = 6 > 5 ✓ — consistent

Common quorum configurations

Quorum reads and writes (W = ⌊N/2⌋ + 1, R = ⌊N/2⌋ + 1): The most common choice. For N=3: W=2, R=2. R+W=4 > 3. Consistent. Tolerates 1 failure.

Write-heavy workload (W=1, R=N): Writes are fast (one ack). Reads require all replicas. Not suitable for availability — any replica failure makes reads fail.

Read-heavy workload (W=N, R=1): Writes require all replicas (durable, synchronous). Reads are fast. Writes fail if any replica is down. Better for read-heavy workloads where write durability is paramount.

Eventual consistency (W=1, R=1): Maximum availability. R+W=2 ≤ N=3. No consistency guarantee — reads may return stale data. Used when availability is paramount and staleness is acceptable (DNS, shopping carts, real-time counters).

Cassandra's consistency levels

Cassandra expresses quorum in named consistency levels per operation:

Level	W or R count	Use case
`ONE`	1	Maximum availability, eventual consistency
`QUORUM`	⌊N/2⌋ + 1	Strong consistency, tolerates minority failures
`ALL`	N	Maximum durability, lowest availability
`LOCAL_QUORUM`	Majority in local DC	Multi-datacenter: consistency within DC
`EACH_QUORUM`	Majority in each DC	Multi-datacenter: consistency across DCs

The URL shortener's click events use CONSISTENCY = ONE for writes (every write must succeed; durability comes from Cassandra's replication factor, not write acknowledgement) and LOCAL_QUORUM for reads that drive analytics dashboards.

Quorum in consensus vs quorum in databases

Consensus quorum (Raft, Paxos): a majority of the total cluster must acknowledge each log entry before it's committed. This is a strict quorum that cannot be adjusted. With 5 nodes, 3 must respond for every write.

Database quorum (Cassandra, DynamoDB): tunable per operation. The application can choose weak consistency for writes and strong consistency for reads, or vice versa. This flexibility is the leaderless database's key advantage over consensus-based systems.

Sloppy quorum

Cassandra and Dynamo use sloppy quorum during failures. If the nodes responsible for a key are unavailable, the write is temporarily accepted by other available nodes (hinted handoff). The write satisfies quorum numerically but not strictly on the designated replicas.

When the failed nodes recover, the hints are replayed — the temporarily stored data is sent to the correct nodes. This improves availability at the cost of brief inconsistency until hints are delivered.

Tradeoffs

Availability vs consistency. Increasing W or R improves consistency but reduces availability (more nodes must respond). Decreasing W or R improves availability but allows stale reads. The R+W>N formula is the mathematical boundary: cross it for consistency, fall below it for availability.

Latency. Quorum reads and writes must wait for the slowest of W (or R) responding nodes. ALL consistency is bounded by the slowest replica in the cluster. ONE returns as soon as the first replica responds. The availability/consistency tradeoff is also a latency tradeoff.

Stale reads at quorum. Even at QUORUM consistency, if the latest write hasn't propagated to the quorum before the read, the read may return a slightly older value. Read repair (Cassandra updates stale replicas after a quorum read detects version differences) mitigates but doesn't eliminate this.

The one thing to remember

Quorum consistency is guaranteed when R + W > N: the read quorum and write quorum overlap, ensuring every read includes at least one node that participated in the latest write. This is the mathematical basis for tunable consistency in distributed databases: increase W and R for stronger consistency (at the cost of availability); decrease them for higher availability (at the cost of possible stale reads). The right quorum settings depend entirely on the consistency requirements and availability targets of the specific operation.

← Previous: Consensus Algorithms — the broader category of algorithms (Paxos, Raft, Zab) that allow distributed nodes to agree on a value despite failures.

→ Next: Paxos — the foundational consensus algorithm, notoriously difficult to understand but the basis of many production systems.

Consensus Algorithms: Agreeing on a Value Across Failures

Cloud Tuned — Wed, 26 Feb 2025 12:00:00 GMT

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures ← you are here	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Consensus Algorithms: Agreeing on a Value Across Failures

The problem

Distributed systems need to agree on things. Which value should a configuration key have? Which node is the current leader? In what order should a sequence of writes be applied? What is the committed state of a transaction?

In a single-machine system, "agree" means "whatever the program says." In a distributed system with multiple nodes that can fail, agreeing becomes a formal problem: how do N nodes reach agreement on a single value, given that some nodes may crash, messages may be delayed, and no node can know for certain what the others have decided?

This is the consensus problem, and it has a family of solutions — Paxos, Raft, Zab, Multi-Paxos — each trading clarity, performance, and operational complexity differently.

The core idea

A consensus algorithm allows a group of nodes to agree on a value such that: all non-faulty nodes eventually decide on the same value (agreement), the decided value was proposed by some node (validity), and every node that decides, decides exactly once (termination). This must hold even if up to ⌊(N-1)/2⌋ nodes fail.

The analogy: a committee voting on a resolution

A committee must pass a resolution. Rules: a resolution passes only if a majority votes for it; no one can change their vote once cast; any committee member who doesn't respond is assumed absent (failed). The chairperson (proposer) proposes a resolution, collects votes, and declares it passed if a majority agree.

The difficulty: some members may receive the proposal late, vote on a different proposal, or fail mid-vote. The consensus algorithm is the set of rules that guarantees the committee eventually reaches a decision that a majority agreed to, despite these complications.

What consensus algorithms provide

The replicated state machine

The most important application of consensus: building a replicated state machine (RSM). An RSM is a service that:

Maintains state (a key-value store, a configuration registry, a transaction log)
Receives commands from clients
Applies commands in a consistent order on all replicas
Provides strongly consistent reads and writes despite node failures

The consensus algorithm is how all replicas agree on the same order of commands. If replica A applies writes as [w1, w2, w3] and replica B applies them as [w2, w1, w3], they end up with different states. Consensus ensures all replicas apply writes in the same order.

Used by: etcd (Kubernetes configuration), ZooKeeper (Kafka coordination), CockroachDB, TiKV, Google Spanner's Paxos groups.

Properties

Safety: all nodes decide the same value. A consensus algorithm that allows different nodes to decide different values is broken. Safety is unconditional — it must hold even under network partitions.

Liveness: all non-faulty nodes eventually decide. A consensus algorithm that blocks forever is useless. Liveness requires a functioning majority.

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no purely asynchronous distributed system can guarantee consensus in the presence of even one failed node. This is not a limitation of algorithm design — it's a mathematical proof. In practice, consensus algorithms work around this by assuming partially synchronous networks (timeouts are bounded) and using randomisation or leader-based approaches to break tie states.

The consensus algorithm family

Single-Decree Paxos

The original Paxos (Lamport, 1989) solves consensus for a single value: how do nodes agree on one value? It runs in two phases: prepare (a proposer asks acceptors to promise not to accept older proposals) and accept (the proposer sends the value; acceptors accept if they haven't promised a newer proposal).

Single-Decree Paxos is correct but impractical for logs — running a new Paxos round for every log entry is expensive.

Multi-Paxos

Extends single-decree Paxos to agree on a sequence of values (a log). A stable leader eliminates the prepare phase for subsequent entries: the leader appends log entries directly, running only the accept phase. When the leader fails, a new leader runs a full Paxos election to establish its authority, then continues appending.

Multi-Paxos is used by Google Chubby, Google Spanner, and is the theoretical basis for most production consensus implementations.

Raft

Raft (Ongaro and Ousterhout, 2014) was explicitly designed to be understandable — Paxos is notorious for being difficult to reason about. Raft decomposes consensus into leader election, log replication, and safety. Covered in depth in post 08.

Zab (ZooKeeper Atomic Broadcast)

Zab is ZooKeeper's consensus protocol. Similar to Raft in its leader-based approach. It provides total order broadcast: all delivered messages are delivered in the same order to all processes. Used exclusively by ZooKeeper.

When is consensus needed vs not needed?

Consensus is needed for:

Leader election (exactly one leader must be agreed upon)
Distributed configuration (all nodes must agree on the current config)
Ordered log replication (replicated databases, Kafka controller)
Distributed transactions (commit or abort must be agreed upon)

Consensus is NOT needed for:

Eventual consistency workloads (Cassandra, DynamoDB) — these accept different nodes seeing different values temporarily
Best-effort coordination (Gossip protocol membership)
Read-heavy workloads with relaxed consistency requirements

Consensus algorithms are expensive — they require multiple round trips between nodes and at least a majority of nodes to be available and reachable. They should be applied precisely where their guarantees are required, not everywhere.

Tradeoffs

Throughput vs latency vs fault tolerance. A consensus algorithm requires at least one round-trip (often two) for each committed value. For a 5ms cross-region RTT, each consensus round takes at minimum 5ms. For a 1ms intra-datacenter RTT, 1ms. This is the fundamental latency floor of consensus-based systems.

Fault tolerance requires more nodes. Tolerating f failures requires 2f+1 nodes. Tolerating 2 failures requires 5 nodes. More nodes means more messages per consensus round and higher coordination overhead.

Leader bottleneck. Leader-based consensus (Raft, Multi-Paxos) routes all writes through one node. Under high write load, the leader becomes the bottleneck. Systems like CockroachDB mitigate this by running multiple independent Raft groups (one per range), each with its own leader.

The one thing to remember

Consensus algorithms allow distributed nodes to agree on a value — or a sequence of values — despite a minority of failures. The key guarantee is that all non-faulty nodes decide the same thing (safety), and that a decision is eventually reached (liveness) as long as a majority is available. Raft is the most important consensus algorithm to understand in practice: it powers etcd, CockroachDB, TiKV, and most modern distributed databases. Consensus is expensive — use it where strong consistency is genuinely required, and accept eventual consistency everywhere else.

← Previous: Leader Election — when a leader fails, the cluster must agree on a new one; this post covers how that agreement is reached safely.

→ Next: Quorum — the minimum vote count that makes consensus safe, and how choosing the right quorum size balances consistency against availability.