Series: System Design · Scalability & Infrastructure — Pillar 6 of 8

Systems Design

#	Post	What it covers
00	Scalability & Infrastructure: The Layer Between Your Code and the Internet	Nine concepts covering load balancing, rate limiting, proxies, compression, and probabilistic data structures that keep large systems fast and reliable.
01	Client-Server Architecture: The Model Everything Else Builds On	Client-server is the foundational model for distributed systems. Learn what clients and servers know, where state lives, and how the model scales.
02	Load Balancing: Distributing Traffic Across Servers	Load balancers distribute traffic across servers for scale and availability. Learn how they work, what types exist, and what they require of backend servers.
03	Load Balancing Algorithms: How Traffic Is Distributed	Round robin, least connections, IP hash, weighted — each algorithm makes different tradeoffs. Learn how to choose the right one for your workload.
04	Rate Limiting: Protecting Services from Overload	Rate limiting protects services from overload and abuse. Learn how token bucket, leaky bucket, and sliding window algorithms work and when to use each.
05	Proxy vs Reverse Proxy: Which Way Does It Face?	Forward proxies protect clients; reverse proxies protect servers. Learn how each works, what Nginx and Cloudflare do, and when you need which.
06	Data Compression: Smaller, Faster, Cheaper	Compression reduces bandwidth and storage costs. Learn how Gzip, Brotli, LZ4, and zstd work, where to apply them, and the CPU tradeoffs involved.
07	Checksums: Detecting Corruption Before It Becomes a Catastrophe ← you are here	Checksums detect silent data corruption in transit and storage. Learn how CRC32, MD5, and SHA-256 work and where to apply them in distributed systems.
08	Bloom Filters: Answering "Have I Seen This?" Without Storing Everything	A Bloom filter answers "have I seen this?" in constant memory. Learn how they work, why false positives are acceptable, and where they're used in production.
09	HyperLogLog: Counting Distinct Items Without Storing Them	HyperLogLog counts distinct values in ~1.5 KB of memory with <2% error. Learn how it works and why Redis, BigQuery, and Postgres use it.
10	Scalability & Infrastructure: Wrap-Up	A recap of all 9 scalability concepts: load balancing, rate limiting, proxies, compression, checksums, Bloom filters, and HyperLogLog. How they fit together.

Checksums: Detecting Corruption Before It Becomes a Catastrophe

The problem

Your URL shortener stores user-uploaded images in S3 and logs every click event to Cassandra. Both systems handle enormous volumes of data reliably. But storage hardware has a defect rate. Network packets get flipped by electrical interference. Memory has occasional single-bit errors. Software has bugs.

Most of these errors are silent. A bit flips in a stored image — the file is still readable, it just shows a corrupted pixel. A Cassandra node writes an event log entry with a flipped byte — the entry looks valid, it's just wrong. An S3 object gets corrupted between the time it was uploaded and the time it's downloaded three months later — neither the uploader nor the downloader knows.

Silent data corruption is real. It's rare on modern hardware — but at scale, rare means regularly. A system handling billions of operations per day will encounter these errors. Without a mechanism to detect them, you serve corrupted data silently, store corrupted results, and lose trust when the corruption eventually surfaces.

Checksums are the mechanism that makes corruption visible.

The core idea

A checksum is a short fixed-length value derived from a block of data using a mathematical function. Computing the same function on the original and received data and comparing the results reveals whether the data has changed — whether through corruption, transmission errors, or tampering.

The analogy: a bank routing number check digit

The routing number on a US bank cheque includes a check digit — the last digit is calculated from the other eight digits using a weighted sum formula. If any single digit is entered incorrectly — a transposition error, a misread digit — the check digit will be wrong. The receiving system catches the error before processing a transaction to the wrong account.

The check digit is a one-digit "checksum." It can detect many common errors (any single-digit change, most transposition errors). It can't detect all errors — but it detects enough of the most common ones to be worth including.

Database checksums, network protocol checksums, and file integrity checksums are the same idea at larger scale: compute a short "fingerprint" of the data, store it alongside the data, and verify it on retrieval.

How checksums work

The basic flow

Sender:
  data = "sho.rt/x7Kp2 → https://example.com/..."
  checksum = CRC32(data)  # e.g., 0x4A3B9C12
  Send: (data, checksum)

Receiver:
  Compute: CRC32(received_data)
  Compare: computed_checksum == received_checksum
  Match → data intact
  Mismatch → data corrupted, reject / request retransmit

CRC32

CRC (Cyclic Redundancy Check) is designed for fast computation in hardware and software. It produces a 32-bit value (4 bytes) for any input length.

Properties:

Detects all single-bit errors
Detects all burst errors up to 32 bits
Very fast — hardware acceleration on modern CPUs
Not cryptographically secure (an attacker can forge a CRC)

Use cases: network packets (Ethernet, TCP/IP), storage (hard drives verify each sector's CRC on every read), file format integrity (ZIP, PNG embed CRC32), database page checksums.

MD5

MD5 produces a 128-bit (16-byte) hash. Widely used for file integrity verification and data deduplication.

Properties:

Extremely low collision probability for random data (two different files rarely produce the same MD5)
Fast — a few hundred MB/s on modern hardware
Not cryptographically secure for security purposes — collisions can be engineered, so MD5 should not be used for authentication or digital signatures
Fine for integrity detection when adversarial forgery is not a concern

Use cases: S3 stores an MD5 checksum (ETag) for every object. Downloads can verify integrity by comparing the computed MD5 of the downloaded bytes with the ETag. Database replication tools (pt-table-checksum) use MD5 to verify that replica tables match the primary.

import hashlib

# Compute MD5 checksum of a file
md5 = hashlib.md5()
with open("file.bin", "rb") as f:
    for chunk in iter(lambda: f.read(65536), b""):
        md5.update(chunk)
checksum = md5.hexdigest()
# "d41d8cd98f00b204e9800998ecf8427e"

SHA-256

SHA-256 produces a 256-bit (32-byte) hash. More expensive than MD5 but cryptographically secure — it's computationally infeasible to find two different inputs that produce the same SHA-256 output.

Properties:

Collision-resistant (for security purposes)
Slower than MD5 (~200MB/s vs ~400MB/s, though hardware acceleration narrows this)
Large enough output to treat collisions as impossible for practical purposes

Use cases: Git uses SHA-256 (now migrating from SHA-1) for commit and object identity. TLS certificates are SHA-256 signed. File distribution systems use SHA-256 for tamper detection where adversarial content manipulation is possible.

Adler-32

A simpler, faster alternative to CRC32. Produces a 32-bit value. Less reliable at detecting small errors than CRC32 (more false negatives for short data), but faster to compute. Used in zlib (and therefore Gzip).

Where checksums live in practice

TCP/IP networking: every TCP segment has a 16-bit checksum. Every Ethernet frame has a CRC32. These catch bit errors in transit. They're computed and verified by network hardware — invisible to applications.

Storage systems: modern hard drives and SSDs verify a CRC32 per sector on every read. If the sector's stored checksum doesn't match its contents, the drive reports an error (or attempts recovery from redundancy). PostgreSQL verifies page checksums when reading from disk (enabled with data_checksums at cluster initialisation).

S3 / object storage: S3 stores an MD5 hash (ETag) for every object. When uploading, clients can send Content-MD5 to verify the upload arrived intact. When downloading, clients can compare the response ETag with a locally computed MD5.

# Upload with integrity verification
s3_client.put_object(
    Bucket="my-bucket",
    Key="exports/2025-06-01.csv",
    Body=data,
    ContentMD5=base64.b64encode(hashlib.md5(data).digest()).decode()
)
# S3 verifies the MD5 matches before acknowledging the upload

Kafka: each Kafka message can carry a CRC32 checksum. Consumers verify integrity on receipt. Kafka brokers verify checksums before committing messages to their log.

Database replication: Percona Toolkit's pt-table-checksum computes checksums of database tables in chunks and compares primary vs replica checksums to detect replication divergence — the replica has the wrong data.

Distributed systems: Merkle trees (used in Cassandra, Dynamo, Bitcoin) build a tree of checksums where each parent is a hash of its children. Comparing the root hash between two nodes reveals whether their data is identical without comparing every row.

Checksums vs hashes vs signatures

	Checksum	Cryptographic hash	Digital signature
Purpose	Error detection	Integrity + identity	Integrity + authentication
Examples	CRC32, Adler-32	MD5, SHA-256	RSA-SHA256, ECDSA
Collision resistance	Low	High	High
Adversarial forgery	Possible	Hard (SHA-256)	Infeasible
Speed	Very fast	Fast	Slow
Use when	Hardware/network error detection	File integrity, deduplication	Tamper detection by adversaries

Tradeoffs

False negatives: no checksum guarantees perfect detection. A CRC32 misses ~1 in 4 billion random corruptions. SHA-256 misses essentially none — but costs more CPU. Choose the algorithm based on the threat model.

Storage overhead: CRC32 adds 4 bytes per block; SHA-256 adds 32 bytes. For individual large files, this is negligible. For databases storing millions of rows, per-row checksums add up.

Computation cost: for high-throughput systems (millions of messages per second), CRC computation at every step has measurable overhead. Hardware acceleration (crc32c instruction on x86) brings CRC32C computation to near-zero cost.

Detect, don't correct: checksums detect corruption but don't fix it. Correction requires either redundant data (erasure coding, covered in Pillar 8) or retransmission (NACK the corrupted packet and request resend).

The one thing to remember

A checksum is a compact fingerprint of data that reveals if the data has changed. Corruption doesn't announce itself — it silently produces wrong results until something downstream fails in a confusing way. Checksums make corruption loud and immediate: compute the fingerprint before storage or transmission, recompute on retrieval, compare. Use CRC32 for speed-critical paths (network packets, disk sectors); use MD5 for file integrity where adversarial forgery isn't a concern; use SHA-256 when you need both integrity and tamper detection.

← Previous: Data Compression — reducing payload size in transit and at rest; the compression algorithm choice involves tradeoffs between speed, ratio, and CPU cost.

→ Next: Bloom Filter — a probabilistic data structure that answers "have I seen this?" in constant memory, with no false negatives.

Checksums: Detecting Corruption Before It Becomes a Catastrophe

Systems Design

Checksums: Detecting Corruption Before It Becomes a Catastrophe

The problem

The core idea

The analogy: a bank routing number check digit

How checksums work

The basic flow

CRC32

MD5

SHA-256

Adler-32

Where checksums live in practice

Checksums vs hashes vs signatures

Tradeoffs

The one thing to remember

Comments

Systems Design

More from this blog

Docker & Kubernetes: What They Are, Why They Matter, and How to Get Started

Introduction to Rancher: Wrangling Kubernetes Clusters at Scale

Networking Fundamentals: A Beginner's Guide to How the Internet Actually Works

Distributed Systems: Wrap-Up

Observability: Understanding Your System at Runtime

Command Palette

Systems Design

Checksums: Detecting Corruption Before It Becomes a Catastrophe

The problem

The core idea

The analogy: a bank routing number check digit

How checksums work

The basic flow

CRC32

MD5

SHA-256

Adler-32

Where checksums live in practice

Checksums vs hashes vs signatures

Tradeoffs

The one thing to remember

Comments

Systems Design

More from this blog