Checksums: Detecting Corruption Before It Becomes a Catastrophe

Series: System Design · Scalability & Infrastructure — Pillar 6 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Scalability & Infrastructure: The Layer Between Your Code and the Internet | Nine concepts covering load balancing, rate limiting, proxies, compression, and probabilistic data structures that keep large systems fast and reliable. |
| 01 | Client-Server Architecture: The Model Everything Else Builds On | Client-server is the foundational model for distributed systems. Learn what clients and servers know, where state lives, and how the model scales. |
| 02 | Load Balancing: Distributing Traffic Across Servers | Load balancers distribute traffic across servers for scale and availability. Learn how they work, what types exist, and what they require of backend servers. |
| 03 | Load Balancing Algorithms: How Traffic Is Distributed | Round robin, least connections, IP hash, weighted — each algorithm makes different tradeoffs. Learn how to choose the right one for your workload. |
| 04 | Rate Limiting: Protecting Services from Overload | Rate limiting protects services from overload and abuse. Learn how token bucket, leaky bucket, and sliding window algorithms work and when to use each. |
| 05 | Proxy vs Reverse Proxy: Which Way Does It Face? | Forward proxies protect clients; reverse proxies protect servers. Learn how each works, what Nginx and Cloudflare do, and when you need which. |
| 06 | Data Compression: Smaller, Faster, Cheaper | Compression reduces bandwidth and storage costs. Learn how Gzip, Brotli, LZ4, and zstd work, where to apply them, and the CPU tradeoffs involved. |
| 07 | Checksums: Detecting Corruption Before It Becomes a Catastrophe ← you are here | Checksums detect silent data corruption in transit and storage. Learn how CRC32, MD5, and SHA-256 work and where to apply them in distributed systems. |
| 08 | Bloom Filters: Answering "Have I Seen This?" Without Storing Everything | A Bloom filter answers "have I seen this?" in constant memory. Learn how they work, why false positives are acceptable, and where they're used in production. |
| 09 | HyperLogLog: Counting Distinct Items Without Storing Them | HyperLogLog counts distinct values in ~1.5 KB of memory with <2% error. Learn how it works and why Redis, BigQuery, and Postgres use it. |
| 10 | Scalability & Infrastructure: Wrap-Up | A recap of all 9 scalability concepts: load balancing, rate limiting, proxies, compression, checksums, Bloom filters, and HyperLogLog. How they fit together. |
Checksums: Detecting Corruption Before It Becomes a Catastrophe
The problem
Your URL shortener stores user-uploaded images in S3 and logs every click event to Cassandra. Both systems handle enormous volumes of data reliably. But storage hardware has a defect rate. Network packets get flipped by electrical interference. Memory has occasional single-bit errors. Software has bugs.
Most of these errors are silent. A bit flips in a stored image — the file is still readable, it just shows a corrupted pixel. A Cassandra node writes an event log entry with a flipped byte — the entry looks valid, it's just wrong. An S3 object gets corrupted between the time it was uploaded and the time it's downloaded three months later — neither the uploader nor the downloader knows.
Silent data corruption is real. It's rare on modern hardware — but at scale, rare means regularly. A system handling billions of operations per day will encounter these errors. Without a mechanism to detect them, you serve corrupted data silently, store corrupted results, and lose trust when the corruption eventually surfaces.
Checksums are the mechanism that makes corruption visible.
The core idea
A checksum is a short fixed-length value derived from a block of data using a mathematical function. Computing the same function on the original and received data and comparing the results reveals whether the data has changed — whether through corruption, transmission errors, or tampering.
The analogy: a bank routing number check digit
The routing number on a US bank cheque includes a check digit — the last digit is calculated from the other eight digits using a weighted sum formula. If any single digit is entered incorrectly — a transposition error, a misread digit — the check digit will be wrong. The receiving system catches the error before processing a transaction to the wrong account.
The check digit is a one-digit "checksum." It can detect many common errors (any single-digit change, most transposition errors). It can't detect all errors — but it detects enough of the most common ones to be worth including.
Database checksums, network protocol checksums, and file integrity checksums are the same idea at larger scale: compute a short "fingerprint" of the data, store it alongside the data, and verify it on retrieval.
How checksums work
The basic flow
Sender:
data = "sho.rt/x7Kp2 → https://example.com/..."
checksum = CRC32(data) # e.g., 0x4A3B9C12
Send: (data, checksum)
Receiver:
Compute: CRC32(received_data)
Compare: computed_checksum == received_checksum
Match → data intact
Mismatch → data corrupted, reject / request retransmit
CRC32
CRC (Cyclic Redundancy Check) is designed for fast computation in hardware and software. It produces a 32-bit value (4 bytes) for any input length.
Properties:
Detects all single-bit errors
Detects all burst errors up to 32 bits
Very fast — hardware acceleration on modern CPUs
Not cryptographically secure (an attacker can forge a CRC)
Use cases: network packets (Ethernet, TCP/IP), storage (hard drives verify each sector's CRC on every read), file format integrity (ZIP, PNG embed CRC32), database page checksums.
MD5
MD5 produces a 128-bit (16-byte) hash. Widely used for file integrity verification and data deduplication.
Properties:
Extremely low collision probability for random data (two different files rarely produce the same MD5)
Fast — a few hundred MB/s on modern hardware
Not cryptographically secure for security purposes — collisions can be engineered, so MD5 should not be used for authentication or digital signatures
Fine for integrity detection when adversarial forgery is not a concern
Use cases: S3 stores an MD5 checksum (ETag) for every object. Downloads can verify integrity by comparing the computed MD5 of the downloaded bytes with the ETag. Database replication tools (pt-table-checksum) use MD5 to verify that replica tables match the primary.
import hashlib
# Compute MD5 checksum of a file
md5 = hashlib.md5()
with open("file.bin", "rb") as f:
for chunk in iter(lambda: f.read(65536), b""):
md5.update(chunk)
checksum = md5.hexdigest()
# "d41d8cd98f00b204e9800998ecf8427e"
SHA-256
SHA-256 produces a 256-bit (32-byte) hash. More expensive than MD5 but cryptographically secure — it's computationally infeasible to find two different inputs that produce the same SHA-256 output.
Properties:
Collision-resistant (for security purposes)
Slower than MD5 (~200MB/s vs ~400MB/s, though hardware acceleration narrows this)
Large enough output to treat collisions as impossible for practical purposes
Use cases: Git uses SHA-256 (now migrating from SHA-1) for commit and object identity. TLS certificates are SHA-256 signed. File distribution systems use SHA-256 for tamper detection where adversarial content manipulation is possible.
Adler-32
A simpler, faster alternative to CRC32. Produces a 32-bit value. Less reliable at detecting small errors than CRC32 (more false negatives for short data), but faster to compute. Used in zlib (and therefore Gzip).
Where checksums live in practice
TCP/IP networking: every TCP segment has a 16-bit checksum. Every Ethernet frame has a CRC32. These catch bit errors in transit. They're computed and verified by network hardware — invisible to applications.
Storage systems: modern hard drives and SSDs verify a CRC32 per sector on every read. If the sector's stored checksum doesn't match its contents, the drive reports an error (or attempts recovery from redundancy). PostgreSQL verifies page checksums when reading from disk (enabled with data_checksums at cluster initialisation).
S3 / object storage: S3 stores an MD5 hash (ETag) for every object. When uploading, clients can send Content-MD5 to verify the upload arrived intact. When downloading, clients can compare the response ETag with a locally computed MD5.
# Upload with integrity verification
s3_client.put_object(
Bucket="my-bucket",
Key="exports/2025-06-01.csv",
Body=data,
ContentMD5=base64.b64encode(hashlib.md5(data).digest()).decode()
)
# S3 verifies the MD5 matches before acknowledging the upload
Kafka: each Kafka message can carry a CRC32 checksum. Consumers verify integrity on receipt. Kafka brokers verify checksums before committing messages to their log.
Database replication: Percona Toolkit's pt-table-checksum computes checksums of database tables in chunks and compares primary vs replica checksums to detect replication divergence — the replica has the wrong data.
Distributed systems: Merkle trees (used in Cassandra, Dynamo, Bitcoin) build a tree of checksums where each parent is a hash of its children. Comparing the root hash between two nodes reveals whether their data is identical without comparing every row.
Checksums vs hashes vs signatures
| Checksum | Cryptographic hash | Digital signature | |
|---|---|---|---|
| Purpose | Error detection | Integrity + identity | Integrity + authentication |
| Examples | CRC32, Adler-32 | MD5, SHA-256 | RSA-SHA256, ECDSA |
| Collision resistance | Low | High | High |
| Adversarial forgery | Possible | Hard (SHA-256) | Infeasible |
| Speed | Very fast | Fast | Slow |
| Use when | Hardware/network error detection | File integrity, deduplication | Tamper detection by adversaries |
Tradeoffs
False negatives: no checksum guarantees perfect detection. A CRC32 misses ~1 in 4 billion random corruptions. SHA-256 misses essentially none — but costs more CPU. Choose the algorithm based on the threat model.
Storage overhead: CRC32 adds 4 bytes per block; SHA-256 adds 32 bytes. For individual large files, this is negligible. For databases storing millions of rows, per-row checksums add up.
Computation cost: for high-throughput systems (millions of messages per second), CRC computation at every step has measurable overhead. Hardware acceleration (crc32c instruction on x86) brings CRC32C computation to near-zero cost.
Detect, don't correct: checksums detect corruption but don't fix it. Correction requires either redundant data (erasure coding, covered in Pillar 8) or retransmission (NACK the corrupted packet and request resend).
The one thing to remember
A checksum is a compact fingerprint of data that reveals if the data has changed. Corruption doesn't announce itself — it silently produces wrong results until something downstream fails in a confusing way. Checksums make corruption loud and immediate: compute the fingerprint before storage or transmission, recompute on retrieval, compare. Use CRC32 for speed-critical paths (network packets, disk sectors); use MD5 for file integrity where adversarial forgery isn't a concern; use SHA-256 when you need both integrity and tamper detection.
← Previous: Data Compression — reducing payload size in transit and at rest; the compression algorithm choice involves tradeoffs between speed, ratio, and CPU cost.
→ Next: Bloom Filter — a probabilistic data structure that answers "have I seen this?" in constant memory, with no false negatives.




