# Data Compression: Smaller, Faster, Cheaper

> **Series:** System Design · Scalability & Infrastructure — Pillar 6 of 8

## Systems Design

| # | Post | What it covers |
|---|------|----------------|
| 00 | [Scalability & Infrastructure: The Layer Between Your Code and the Internet](/scalability-infrastructure-the-layer-between-your-code-and-the-internet) | Nine concepts covering load balancing, rate limiting, proxies, compression, and probabilistic data structures that keep large systems fast and reliable. |
| 01 | [Client-Server Architecture: The Model Everything Else Builds On](/client-server-architecture-the-model-everything-else-builds-on) | Client-server is the foundational model for distributed systems. Learn what clients and servers know, where state lives, and how the model scales. |
| 02 | [Load Balancing: Distributing Traffic Across Servers](/load-balancing-distributing-traffic-across-servers) | Load balancers distribute traffic across servers for scale and availability. Learn how they work, what types exist, and what they require of backend servers. |
| 03 | [Load Balancing Algorithms: How Traffic Is Distributed](/load-balancing-algorithms-how-traffic-is-distributed) | Round robin, least connections, IP hash, weighted — each algorithm makes different tradeoffs. Learn how to choose the right one for your workload. |
| 04 | [Rate Limiting: Protecting Services from Overload](/rate-limiting-protecting-services-from-overload) | Rate limiting protects services from overload and abuse. Learn how token bucket, leaky bucket, and sliding window algorithms work and when to use each. |
| 05 | [Proxy vs Reverse Proxy: Which Way Does It Face?](/proxy-vs-reverse-proxy-which-way-does-it-face) | Forward proxies protect clients; reverse proxies protect servers. Learn how each works, what Nginx and Cloudflare do, and when you need which. |
| 06 | **Data Compression: Smaller, Faster, Cheaper** ← you are here | Compression reduces bandwidth and storage costs. Learn how Gzip, Brotli, LZ4, and zstd work, where to apply them, and the CPU tradeoffs involved. |
| 07 | [Checksums: Detecting Corruption Before It Becomes a Catastrophe](/checksums-detecting-corruption-before-it-becomes-a-catastrophe) | Checksums detect silent data corruption in transit and storage. Learn how CRC32, MD5, and SHA-256 work and where to apply them in distributed systems. |
| 08 | [Bloom Filters: Answering "Have I Seen This?" Without Storing Everything](/bloom-filters-answering-have-i-seen-this-without-storing-everything) | A Bloom filter answers "have I seen this?" in constant memory. Learn how they work, why false positives are acceptable, and where they're used in production. |
| 09 | [HyperLogLog: Counting Distinct Items Without Storing Them](/hyperloglog-counting-distinct-items-without-storing-them) | HyperLogLog counts distinct values in ~1.5 KB of memory with <2% error. Learn how it works and why Redis, BigQuery, and Postgres use it. |
| 10 | [Scalability & Infrastructure: Wrap-Up](/scalability-infrastructure-wrap-up) | A recap of all 9 scalability concepts: load balancing, rate limiting, proxies, compression, checksums, Bloom filters, and HyperLogLog. How they fit together. |

---

# Data Compression: Smaller, Faster, Cheaper

## The problem

Your URL shortener's API returns JSON. A typical response listing a user's links looks like:

```json
{
  "links": [
    {"id": "x7Kp2", "destination": "https://example.com/...", "click_count": 1234, "created_at": "..."},
    ...
  ],
  "total": 50,
  "page": 1
}
```

For a user with fifty links, that response is about 15KB. At one hundred thousand API calls per day, that's 1.5GB of outbound bandwidth per day just for this one endpoint. At AWS data transfer rates, that's about $135/month in egress costs — for one endpoint.

Now add the analytics dashboard (larger responses), the link detail pages, and the real-time click feed. Bandwidth costs scale with traffic. For a business growing at 20% per month, this cost doubles in four months.

Compression is the fastest win available. JSON — structured text with repeated keys, predictable patterns, and limited character variety — compresses by 70–85% with Gzip. That 15KB response becomes 2–4KB. That $135/month becomes $20–40/month. And users on slow connections get the response faster.

---

## The core idea

Compression algorithms find patterns and redundancy in data and replace them with shorter representations. A compressor and a decompressor share the same algorithm — the compressor creates the compact form, the decompressor reconstructs the original. The tradeoff is CPU time for compression and decompression, paid in exchange for reduced byte size.

---

## The analogy: an efficient filing system

An inefficient clerk writes "the product with identification number 12345 has a price of $45.00" for every product record. An efficient one uses a template: "item 12345: $45.00" — same information, fewer bytes. The more predictable the structure and the more repetitive the patterns, the more efficiently the records compress.

Compression algorithms are automated template finders. They scan data, identify repeated sequences, and replace them with references to previous occurrences. The more repetitive the data, the better the compression ratio.

This is why JSON (highly repetitive: field names repeat on every object) compresses beautifully, binary images don't (already encoded efficiently with little redundancy), and already-compressed data (ZIP files, MP4 videos) gets slightly larger if you try to compress it again (adding compression headers with no corresponding savings).

---

## How compression works

### LZ77 and the dictionary model

Most general-purpose compression algorithms (including Gzip, Brotli, and zstd) are descendants of LZ77. The core idea: as the compressor scans data, it maintains a sliding window of recently seen bytes. When it encounters a sequence it has seen before within the window, it emits a reference (offset, length) instead of repeating the bytes.

```
Input: "the cat sat on the mat"
         ^^^                  First "the": emit literally
                        ^^^   Second "the": emit (offset=17, length=3) — reference to earlier "the"
```

After LZ77 finds these references, a second pass applies Huffman coding — assigning shorter bit sequences to more frequent symbols — further reducing size.

### Gzip

Gzip (DEFLATE algorithm — LZ77 + Huffman coding) is the universal default for HTTP response compression. Every browser supports it. Every web server can emit it.

```
HTTP Response:
  Content-Encoding: gzip

Browser sends:
  Accept-Encoding: gzip, deflate, br

Nginx (or app server) compresses the response body before sending
Browser decompresses on receipt
```

Compression ratios for typical web payloads:
- JSON API responses: 70–85% reduction (15KB → 2–4KB)
- HTML pages: 65–80% reduction
- CSS: 75–85% reduction
- JavaScript: 60–75% reduction
- Images (JPEG, PNG): 0–5% reduction (already compressed)
- Video (MP4): 0% or slightly larger (already compressed)

**Compression levels:** Gzip offers levels 1–9. Level 1 is fast, low compression ratio. Level 9 is slow, maximum ratio. Level 6 (the default) provides most of the ratio benefit at moderate CPU cost. In practice, levels 5–6 are the sweet spot for HTTP responses.

**When to skip:** binary data (images, audio, video, already-compressed archives) is not worth compressing — you'll pay CPU cost for negligible or negative savings.

### Brotli

Brotli (Google, 2015) is a more modern algorithm designed specifically for web content. It uses a pre-built dictionary of common web content patterns (HTML tags, CSS keywords, common URL structures), enabling better initial compression before the sliding window fills.

Brotli typically achieves 15–25% better compression ratio than Gzip for web content at equivalent CPU cost. Most modern browsers support it (`Accept-Encoding: br`).

```
# Nginx Brotli configuration
brotli on;
brotli_comp_level 6;
brotli_types text/html text/css application/json application/javascript;
```

Brotli's pre-built dictionary is only an advantage for text content. For arbitrary binary data, Brotli and Gzip perform similarly.

**Practical guidance:** enable both Gzip and Brotli, send Brotli to browsers that support it (`br` in `Accept-Encoding`), fall back to Gzip for others.

### LZ4

LZ4 prioritises decompression speed above all else — it decompresses at several GB/second on modern hardware. Compression ratio is lower than Gzip (roughly 40–60% reduction for typical data), but it's extremely fast in both directions.

LZ4 is used for:
- In-memory caches where compressed data must be decompressed on every access
- Network transfers where latency matters more than bandwidth (local data centre transfers)
- Streaming compression where input arrives continuously

### zstd (Zstandard)

zstd (Facebook, 2016) offers a remarkable range of compression levels — from LZ4-like speed (level 1) to Brotli-like ratio (level 22), with a tunable speed/ratio tradeoff across the entire range.

zstd is the modern general-purpose choice for data at rest:
- PostgreSQL uses zstd for TOAST compression
- S3 supports zstd-compressed objects
- Kafka supports zstd for message compression
- RocksDB/Cassandra use zstd for SSTable compression

At its default level (level 3), zstd is typically faster than Gzip while achieving better or equivalent compression ratios.

---

## Where to apply compression

**HTTP responses (reverse proxy or app server):** enable Gzip + Brotli for all text content (`application/json`, `text/html`, `text/css`, `application/javascript`). This is almost always the right default.

**Database storage:** modern databases compress data at the page level (PostgreSQL uses pglz or lz4 for TOAST; Cassandra uses lz4 or zstd for SSTables). Enable compression for text-heavy or JSON columns.

**Object storage:** compress text files (logs, CSV exports, JSON datasets) before uploading to S3. zstd achieves better ratios than Gzip at faster speeds for bulk data.

**Message queues:** Kafka supports producer-level compression (gzip, snappy, lz4, zstd). Enable compression for message payloads with high text content.

**Backups:** always compress backups. A PostgreSQL dump compressed with zstd is typically 80–90% smaller than uncompressed.

---

## Tradeoffs

**Compression ratio vs CPU cost.** Higher compression levels (Gzip 9, zstd 20) achieve better ratios but use significantly more CPU. For high-throughput services, the CPU cost of compression at levels above the default may outweigh the bandwidth savings.

**First-byte latency.** Synchronous compression adds latency before the first byte is sent. Streaming compression sends data in chunks as it's compressed — lower first-byte latency. Nginx and most web servers use streaming compression by default for dynamic content.

**Compressing already-compressed data wastes CPU.** JPEG, PNG, MP4, MP3, ZIP, and most binary formats are already compressed. Adding Gzip on top adds CPU cost with no benefit. Configure compression to apply only to content types that benefit.

**Proxy caches and CDNs.** CDN edge nodes must cache both compressed and uncompressed versions of a response (or vary on `Accept-Encoding`). This doubles storage at the edge but ensures every client gets the right encoding.

---

## The one thing to remember

> **Compression is almost always worth enabling for text content over HTTP.** A JSON response that compresses from 15KB to 3KB gives a 5x reduction in bandwidth cost and a proportional improvement in transfer time, especially for users on slow connections. The CPU cost is negligible compared to the bandwidth and cost savings. Enable Gzip and Brotli at your reverse proxy for all text content types, skip binary content, and use zstd for data at rest in databases, object storage, and backups.

---

*← Previous: **[Proxy vs Reverse Proxy](/proxy-vs-reverse-proxy-which-way-does-it-face)** — the intermediaries that sit between clients and servers, and why the direction matters.*

*→ Next: **[Checksums](/checksums-detecting-corruption-before-it-becomes-a-catastrophe)** — how do you know the data you received is the data that was sent? Checksums detect silent corruption without sending the data twice.*