Skip to main content

Command Palette

Search for a command to run...

Distributed Systems: Wrap-Up

Updated
13 min readView as Markdown
Distributed Systems: Wrap-Up

Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

# Post What it covers
00 Distributed Systems: What Happens When Machines Disagree Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01 Network Partitions: The Failure Mode You Can't Design Away Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02 Split-Brain: When Two Nodes Both Think They're the Leader Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03 Heartbeats: How Nodes Know Their Peers Are Alive Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04 Leader Election: Agreeing on Who's in Charge Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05 Consensus Algorithms: Agreeing on a Value Across Failures Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06 Quorum: How Many Nodes Must Agree? Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07 Paxos: The Algorithm That Started It All Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08 Raft: Consensus Made Understandable Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09 Gossip Protocol: Decentralised Cluster Communication Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10 Logical Clocks: When Physical Time Isn't Enough Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11 Lamport Timestamps: Ordering Events Without a Global Clock Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12 Vector Clocks: Knowing When Events Are Truly Concurrent Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13 Distributed Transactions: When One Machine Isn't Enough Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14 Two-Phase Commit: Coordinating a Distributed Decision 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15 Three-Phase Commit: Solving 2PC's Blocking Problem 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16 Delivery Semantics: What Does "Delivered" Actually Mean? Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17 Change Data Capture: Streaming Your Database in Real Time CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18 Erasure Coding: Fault Tolerance Without Full Replication Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19 Merkle Trees: Efficiently Finding What's Different Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20 Observability: Understanding Your System at Runtime Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21 Distributed Systems: Wrap-Up ← you are here A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Distributed Systems: Wrap-Up

Twenty concepts in the final pillar. This post ties them together and closes the series with a complete picture of the URL shortener's architecture across all eight pillars.


The one thing to remember from each post

Network Partitions — partitions are inevitable. Every distributed system must have an explicit policy: stay available and accept potential inconsistency (AP), or refuse requests until the partition heals (CP). CAP doesn't tell you which to choose — it tells you the choice is unavoidable.

Split-Brain — two nodes simultaneously acting as primary cause data corruption. Prevention requires fencing (STONITH): ensure the old primary cannot write before promoting a new one. Never run critical databases in a two-node configuration without a tiebreaker.

Heartbeats — nodes detect peer failures through absence of periodic signals. Too short a timeout causes false positives; too long delays recovery. Phi accrual failure detectors adapt to observed network behaviour, reducing false positives under load.

Leader Election — Raft's term-based election: increment term, campaign for votes, win with a majority, prevent old leaders from reasserting via term comparison. The election gap (old leader dead to new leader established) is the fundamental cost of leader-based coordination.

Consensus Algorithms — Paxos, Raft, Zab allow distributed nodes to agree on a value despite a minority of failures. Safety (all decide the same thing) is unconditional; liveness (a decision is eventually made) requires a functioning majority.

Quorum — R + W > N guarantees consistency: the read quorum and write quorum overlap. Increase W and R for consistency; decrease for availability. Cassandra's tunable consistency levels let you choose per operation.

Paxos — prepare (gather promises not to accept older proposals) then accept (commit the value if promises hold). The foundational consensus algorithm. Correct but hard to implement; the inspiration for Raft.

Raft — leader election + log replication + safety. A candidate must win majority votes with an up-to-date log. The leader replicates entries to followers; commits when a majority acknowledge. Term numbers prevent old leaders from reclaiming authority.

Gossip Protocol — each node randomly selects a few peers and exchanges state every second. Information propagates exponentially fast (O(log N) rounds) with no central coordinator. Used by Cassandra, Redis Cluster, and Consul for cluster membership.

Logical Clocks — physical clocks drift and can't establish causal order between events on different machines. Logical clocks track the happened-before relation instead.

Lamport Timestamps — monotonically increasing counters: increment on events, take max+1 on message receive. If A → B then ts(A) < ts(B). Can't detect concurrency.

Vector Clocks — one counter per process. Comparing two vector clocks reveals whether one happened-before the other or they're concurrent (each has a higher counter in some dimension). The basis of conflict detection in Dynamo and Riak.

Distributed Transactions — atomicity across multiple nodes is expensive and fragile. Most cross-service coordination that appears to require distributed transactions can be handled with Saga patterns instead. Use CockroachDB or Spanner when you genuinely need distributed ACID.

Two-Phase Commit — prepare (all vote, hold locks) then commit/abort (coordinator sends decision). Guarantees atomicity but blocks if coordinator fails after votes but before commit message. The foundation of XA transactions.

Three-Phase Commit — adds a pre-commit phase to allow independent commit decisions if coordinator fails. Works under crash failures; unsafe under network partitions. Rarely used in practice.

Delivery Semantics — at-most-once (fast, lossy), at-least-once (reliable, possible duplicates), exactly-once (expensive, achievable within Kafka-to-Kafka). At-least-once with idempotent consumers is the practical choice for almost everything.

Change Data Capture — read the database's WAL and stream every change as an event. The application writes once (to the database); CDC handles fan-out to Elasticsearch, data warehouses, cache invalidation, and the Outbox relay. Debezium is the standard tool.

Erasure Coding — split data into k data chunks and n−k parity chunks; any k of n reconstruct the data. 10+4 Reed-Solomon tolerates 4 node failures with 40% storage overhead versus 200% for 3× replication. Best for cold, large, infrequently accessed data.

Merkle Trees — a tree of hashes where each parent is a hash of its children. Comparing root hashes proves identical content; different roots identify diverged subtrees in O(k log n) comparisons. Used by Cassandra for anti-entropy repair, Git for change detection, and Bitcoin for lightweight transaction verification.

Observability — structured logs (what happened), metrics (how is the system behaving), and distributed traces (where did the time go). Together they make a distributed black box understandable at runtime. Not optional in microservices.


How the pillars connect

Pillar 1 (Foundations) established the theoretical limits: CAP theorem, consistency models, ACID vs BASE. Every decision in Pillars 4–8 is a concrete application of those tradeoffs.

Pillar 2 (Networking) established the transport: DNS, CDN, HTTP/TLS. The infrastructure that every distributed system rides on.

Pillar 3 (APIs) established communication patterns: REST, gRPC, WebSockets, webhooks. The interface between services and clients.

Pillar 4 (Data & Storage) established what to store and where: PostgreSQL, Cassandra, Redis, S3, Elasticsearch, vector databases. Every data decision in Pillar 8 references a system introduced in Pillar 4.

Pillar 5 (Caching) established the performance layer: cache-aside, eviction, distributed cache, stampede prevention. The reason most systems don't feel slow despite touching many data stores.

Pillar 6 (Scalability) established the infrastructure layer: load balancing, rate limiting, compression, checksums, probabilistic data structures. The plumbing between the internet and application code.

Pillar 7 (Architecture) established the system structure: how services are decomposed, how they communicate asynchronously, how resilience is built in, how data flows through pipelines. The blueprint that Pillar 8 fills in with correctness guarantees.

Pillar 8 (Distributed Systems) established what happens under failure: network partitions, split-brain, consensus, causal ordering, distributed transactions, data integrity, and observability. The foundation that makes all the other pillars work correctly under adversarial conditions.


The complete URL shortener architecture

═══════════════════════════════════════════════════════════════════
PILLAR 2: Networking
  DNS: Route 53 with geo-routing → nearest edge
  CDN: Cloudflare (Anycast, TLS termination, popular redirect caching)
═══════════════════════════════════════════════════════════════════

PILLAR 6: Scalability
  Rate limiting: Token bucket per API key (Redis counters, global)
  Load balancer: AWS ALB, least-connections algorithm
  Compression: Brotli for all text responses

PILLAR 7: Architecture (entry layer)
  BFF: Mobile BFF + Web BFF + Public API BFF
  Reverse proxy: Nginx (TLS termination, routing, compression)

═══════════════════════════════════════════════════════════════════
SERVICES
═══════════════════════════════════════════════════════════════════

Redirect Service (Go, latency-critical):
  PILLAR 5: Redis cache (consistent hashing, LFU eviction, SWR)
  PILLAR 4: PostgreSQL on cache miss (index scan on short_code)
  PILLAR 6: Bloom filter for unique visitor dedup (RedisBloom)
  PILLAR 6: HyperLogLog for unique visitor counts (Redis PFADD)
  PILLAR 8: Circuit breaker on PostgreSQL calls
  PILLAR 8: Distributed traces (OpenTelemetry → Honeycomb)

Link Service (Rails, CRUD + workflows):
  PILLAR 4: PostgreSQL (primary + 2 read replicas)
  PILLAR 7: Saga pattern for subscription upgrade workflow
  PILLAR 7: Outbox pattern (PostgreSQL outbox → Debezium CDC → Kafka)
  PILLAR 8: CDC via Debezium → link.created events to Kafka
  PILLAR 8: Circuit breakers on all external calls

Analytics Service (Python, stream + batch):
  PILLAR 4: Cassandra (click events, wide-column, TWCS compaction)
  PILLAR 4: TimescaleDB (metrics aggregates)
  PILLAR 7: CQRS: write to Cassandra, read from pre-aggregated views
  PILLAR 7: Stream processing: Kafka → Flink (real-time counters)
  PILLAR 7: Batch processing: nightly Spark job (historical aggregates)
  PILLAR 8: Gossip-based cluster membership (Cassandra ring)
  PILLAR 8: Quorum reads (LOCAL_QUORUM for dashboard queries)

User / Billing / Notification Services:
  PILLAR 4: PostgreSQL (each service owns its schema)
  PILLAR 7: Event-driven reactions to Kafka events
  PILLAR 7: Serverless (Lambda) for scheduled jobs

═══════════════════════════════════════════════════════════════════
DATA INFRASTRUCTURE
═══════════════════════════════════════════════════════════════════

PILLAR 4: Data stores
  PostgreSQL primary (EBS block storage, db.r5.8xlarge)
  PgBouncer (connection pool, transaction mode, 40 connections)
  Redis Cluster (3 primaries + 3 replicas, consistent hashing)
  Cassandra Cluster (6 nodes, gossip membership, phi accrual detection)
  InfluxDB (infrastructure metrics, time-window compaction)
  Elasticsearch (full-text search, Debezium CDC sync)
  Pinecone (semantic embeddings, HNSW ANN search)
  S3 (user uploads, exports; erasure-coded durability internally)

PILLAR 8: Distributed systems layer
  Raft consensus: etcd (Kubernetes state), CockroachDB (billing transactions)
  Gossip: Cassandra ring membership, Redis Cluster slot state
  Anti-entropy: Cassandra repair via Merkle tree comparison
  CDC: Debezium watching PostgreSQL WAL → Kafka
  Observability:
    Logs: structured JSON → Loki → Grafana
    Metrics: Prometheus → Grafana (p99 latency, error rate, saturation)
    Traces: OpenTelemetry → Honeycomb (distributed trace per request)

PILLAR 8: Failure handling
  Split-brain prevention: Patroni with EC2 instance termination fencing
  Heartbeats: 10s interval, phi accrual failure detection (Cassandra)
  Leader election: Raft (etcd), Patroni (PostgreSQL HA)
  Delivery semantics: at-least-once + idempotent consumers everywhere
  Partition policy:
    PostgreSQL: CP (Patroni requires quorum for promotion)
    Cassandra: AP tunable (LOCAL_QUORUM for reads, ONE for click writes)
    Redis: AP (last-write-wins on partition heal)

PILLAR 7 → 8 bridges:
  Saga + Outbox + CDC: subscription upgrades are saga-coordinated;
    each step uses outbox pattern for reliable event publishing;
    Debezium picks up outbox rows without polling
  Circuit breaker + Observability: every tripped circuit breaker
    emits a metric and a trace span, surfaced immediately in Grafana

What this series covered

Eight pillars. Roughly 120 concepts. One URL shortener that started as a single Rails app on a single server and ended as a distributed, multi-region, observable, fault-tolerant platform.

The series deliberately follows the same thread: every concept was introduced when the running example created a problem that concept solves. Consistent hashing appeared because the Redis cluster needed to add nodes without invalidating the cache. The Outbox pattern appeared because the Saga needed reliable event publishing. Merkle trees appeared because Cassandra needed to repair replicas that diverged during a partition.

Systems design isn't a collection of unrelated tools. It's a set of problems that naturally occur as systems grow, and a set of patterns that address each problem with known tradeoffs. The goal of this series was to build that mental model — not a checklist, but a way of reasoning about what a system needs and why.


Thank you for reading

This is the final post in the series. If you found it useful, the best thing you can do is share the specific posts that helped you — the ones that clicked something into place, answered a question you'd had for a while, or gave you language for something you'd been doing instinctively.

The series starts here: Pillar 1 — Foundations


*← Previous: Observability — structured logs, metrics, and distributed tracing: the tools that make distributed systems understandable at runtime.*This is the final post in the System Design series. Start from the beginning: Pillar 1 — Foundations

Systems Design

Part 1 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.