# Rate Limiting: Protecting Services from Overload

> **Series:** System Design · Scalability & Infrastructure — Pillar 6 of 8

## Systems Design

| # | Post | What it covers |
|---|------|----------------|
| 00 | [Scalability & Infrastructure: The Layer Between Your Code and the Internet](/scalability-infrastructure-the-layer-between-your-code-and-the-internet) | Nine concepts covering load balancing, rate limiting, proxies, compression, and probabilistic data structures that keep large systems fast and reliable. |
| 01 | [Client-Server Architecture: The Model Everything Else Builds On](/client-server-architecture-the-model-everything-else-builds-on) | Client-server is the foundational model for distributed systems. Learn what clients and servers know, where state lives, and how the model scales. |
| 02 | [Load Balancing: Distributing Traffic Across Servers](/load-balancing-distributing-traffic-across-servers) | Load balancers distribute traffic across servers for scale and availability. Learn how they work, what types exist, and what they require of backend servers. |
| 03 | [Load Balancing Algorithms: How Traffic Is Distributed](/load-balancing-algorithms-how-traffic-is-distributed) | Round robin, least connections, IP hash, weighted — each algorithm makes different tradeoffs. Learn how to choose the right one for your workload. |
| 04 | **Rate Limiting: Protecting Services from Overload** ← you are here | Rate limiting protects services from overload and abuse. Learn how token bucket, leaky bucket, and sliding window algorithms work and when to use each. |
| 05 | [Proxy vs Reverse Proxy: Which Way Does It Face?](/proxy-vs-reverse-proxy-which-way-does-it-face) | Forward proxies protect clients; reverse proxies protect servers. Learn how each works, what Nginx and Cloudflare do, and when you need which. |
| 06 | [Data Compression: Smaller, Faster, Cheaper](/data-compression-smaller-faster-cheaper) | Compression reduces bandwidth and storage costs. Learn how Gzip, Brotli, LZ4, and zstd work, where to apply them, and the CPU tradeoffs involved. |
| 07 | [Checksums: Detecting Corruption Before It Becomes a Catastrophe](/checksums-detecting-corruption-before-it-becomes-a-catastrophe) | Checksums detect silent data corruption in transit and storage. Learn how CRC32, MD5, and SHA-256 work and where to apply them in distributed systems. |
| 08 | [Bloom Filters: Answering "Have I Seen This?" Without Storing Everything](/bloom-filters-answering-have-i-seen-this-without-storing-everything) | A Bloom filter answers "have I seen this?" in constant memory. Learn how they work, why false positives are acceptable, and where they're used in production. |
| 09 | [HyperLogLog: Counting Distinct Items Without Storing Them](/hyperloglog-counting-distinct-items-without-storing-them) | HyperLogLog counts distinct values in ~1.5 KB of memory with <2% error. Learn how it works and why Redis, BigQuery, and Postgres use it. |
| 10 | [Scalability & Infrastructure: Wrap-Up](/scalability-infrastructure-wrap-up) | A recap of all 9 scalability concepts: load balancing, rate limiting, proxies, compression, checksums, Bloom filters, and HyperLogLog. How they fit together. |

---

# Rate Limiting: Protecting Services from Overload

## The problem

Your URL shortener's redirect API has no limits. One morning, a competitor starts scraping — sending five thousand requests per second from a single IP, systematically cycling through short codes to harvest destination URLs. Your server handles it, but barely. Response time for legitimate users climbs.

A week later, a developer integrating with your API has a bug in their retry loop. Their service hammers your API at two thousand requests per second, non-stop, until someone notices. Your database connection pool exhausts. Requests start timing out. Legitimate users are affected.

Neither the scraper nor the buggy client intended to cause harm — but both did. In both cases, there was no mechanism to detect and stop the behaviour automatically. You needed someone to notice, diagnose, and manually block the source. That takes time. Damage accumulates.

Rate limiting is the mechanism that stops this automatically, before it becomes a crisis.

---

## The core idea

A rate limiter enforces a maximum request rate per client (or per IP, per API key, per endpoint). Requests within the limit are served normally. Requests that exceed the limit are rejected immediately — typically with HTTP 429 Too Many Requests — without consuming significant backend resources.

The client is responsible for slowing down, retrying with backoff, or queuing. The server's job is to protect itself, not to absorb unlimited load.

---

## The analogy: a nightclub with a capacity policy

The doorman at a nightclub enforces a policy: no more than 100 new people per hour, and no single group of more than 10 people at once. It doesn't matter if the people in line are great — the policy is the policy. Those turned away can come back in an hour. The doorman doesn't call the manager; they just enforce the rule consistently.

The rate limiter is the doorman. The limit is the policy. HTTP 429 is "come back later." The client queue is the line outside.

---

## Rate limiting algorithms

### Fixed Window

Count requests in a fixed time window (e.g., 1-minute slots). If a client exceeds the limit in a window, reject subsequent requests until the window resets.

```
Window: 60-second slots
Limit: 100 requests per window

Client requests at t=0:  60 requests (allowed)
Client requests at t=45: 40 more (allowed — total 100)
Client requests at t=50: 1 more (rejected — 429)
Window resets at t=60
Client requests at t=61: allowed again
```

**Strengths:** Simple to implement (one counter per client per window). Trivial in Redis: `INCR client:123:window:1735689600` with expiry at window end.

**Weaknesses:** Boundary burst — a client can make 100 requests just before the window resets and 100 more just after, sending 200 requests in a two-second span despite the "100 per minute" limit. The window boundary is a burst opportunity.

---

### Sliding Window Log

Record the timestamp of every request. For each new request, count how many requests in the log fall within the last N seconds. If the count exceeds the limit, reject.

```
Limit: 5 requests per 10 seconds
Current time: t=15

Request log: [t=6, t=9, t=11, t=13, t=14]  (5 entries in [5..15])
New request at t=15: count in [5,15] = 5 → reject (429)
```

**Strengths:** No boundary burst — the window is always the last N seconds relative to the current moment.

**Weaknesses:** Must store every request timestamp. At high request rates per client, the log is large. Memory-intensive.

---

### Sliding Window Counter

A hybrid: track counts in two fixed windows (current + previous) and interpolate.

```
Limit: 100 requests per minute
Previous window count: 80
Current window count: 30
Elapsed fraction of current window: 40%

Estimated requests in last 60 seconds:
  = (80 × (1 - 0.4)) + 30
  = (80 × 0.6) + 30
  = 48 + 30
  = 78 — within limit, allow
```

**Strengths:** Smooth approximation of sliding window without storing individual timestamps. O(1) space per client.

**Weaknesses:** Slight inaccuracy — assumes requests in the previous window were uniformly distributed.

**Implementation:** Two Redis hash fields per client: `window:prev` and `window:curr`, plus a timestamp for the current window start.

---

### Token Bucket

Each client has a bucket that holds up to N tokens. Tokens are added at a fixed rate (R tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.

```
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second

t=0: bucket=100, request uses 1 token → bucket=99
t=0: burst of 99 more requests → bucket=0
t=1: 10 tokens added → bucket=10
t=1: 10 requests → bucket=0
t=2: 10 tokens added → bucket=10
...
```

**Strengths:** Allows bursts up to bucket capacity. The sustained rate is limited to the refill rate. Burst tolerance is a feature — brief spikes (a user clicking a link multiple times quickly) are absorbed; sustained high volume is throttled.

**Weaknesses:** Two parameters to tune (bucket size and refill rate). Token accumulation during idle periods allows a burst equal to the full bucket capacity.

**Lazy evaluation in Redis:** don't maintain a timer to add tokens. Instead, on each request, compute how many tokens would have been added since the last request and add them then.

```python
def token_bucket_check(client_id, bucket_capacity, refill_rate):
    now = time.time()
    last_time, tokens = redis.hmget(f"bucket:{client_id}", "last_time", "tokens")

    if last_time is None:
        tokens = bucket_capacity
    else:
        elapsed = now - float(last_time)
        tokens = min(bucket_capacity, float(tokens) + elapsed * refill_rate)

    if tokens < 1:
        return False  # 429

    redis.hset(f"bucket:{client_id}", mapping={"last_time": now, "tokens": tokens - 1})
    redis.expire(f"bucket:{client_id}", 3600)
    return True
```

---

### Leaky Bucket

Requests fill a queue (the "bucket"). A fixed-rate processor drains the queue, forwarding one request per time unit. If the queue is full, new requests are dropped.

```
Queue capacity: 50
Processing rate: 10 requests/second

Burst of 100 requests arrives:
  50 enter the queue; 50 are dropped immediately
  Queue drains at 10/s → 5 seconds to drain
```

**Strengths:** Produces perfectly smooth output traffic regardless of input burst. Ideal for protecting backends that require steady input rate.

**Weaknesses:** Under burst, requests queue and experience latency before processing. High-burst workloads see the first 50 requests delayed, not the last 50 rejected. Token bucket is usually preferable for APIs — callers prefer immediate rejection over delayed processing.

---

## Where rate limiting lives

**At the API gateway / reverse proxy:** rate limiting before the request reaches application code. Nginx and Kong have rate limiting modules. This is the most efficient location — rejected requests don't consume app server CPU or database connections.

**In application code:** more flexible (custom logic per endpoint, per user tier), but adds overhead to every request and requires a shared counter store (Redis) for multi-server deployments.

**At the edge (Cloudflare, Fastly):** rate limiting before traffic reaches your data centre at all. Best for DDoS protection — the attack traffic is absorbed by the CDN edge, not your infrastructure.

### Distributed rate limiting

With multiple app servers, a per-server counter is not enough — a client can send 100 requests to Server A, 100 to Server B, and 100 to Server C, staying under the per-server limit while sending 300× its quota in aggregate.

The counter must be global. Redis is the standard solution — all servers increment the same key. A Lua script in Redis makes the check-and-increment atomic:

```lua
-- Atomic check-and-increment
local current = redis.call("INCR", KEYS[1])
if current == 1 then
  redis.call("EXPIRE", KEYS[1], ARGV[1])
end
if current > tonumber(ARGV[2]) then
  return 0  -- reject
end
return 1  -- allow
```

---

## Tradeoffs

**Precision vs cost.** Sliding window log is most accurate but expensive. Sliding window counter is approximate but O(1) space. Fixed window is simple but has boundary burst vulnerability. Token bucket is accurate and burst-tolerant but has two parameters.

**Granularity.** Rate limits can apply at many levels: per IP, per API key, per user, per endpoint, per user-endpoint pair. Finer granularity requires more counter keys but provides more precise protection. A malicious user using multiple IPs defeats per-IP limits.

**Client experience.** Always return HTTP 429 with a `Retry-After` header — tell the client when they can try again. Clients that don't see `Retry-After` often implement aggressive retries that make the problem worse.

**Tiered limits.** Free tier: 100 requests/minute. Pro tier: 1000 requests/minute. Enterprise: unlimited. Rate limits should be a product feature, not just an infrastructure protection — different plans get different limits, and clients should know their tier's limits.

---

## The one thing to remember

> **Rate limiting is the mechanism that ensures one client can't consume a disproportionate share of your service's capacity.** Token bucket is the most practical algorithm for most APIs — it allows short bursts (good UX) while enforcing a sustained rate (protection). Always use a shared counter store (Redis) for distributed deployments so the limit is global, not per-instance. And always return `Retry-After` — clients that know when to retry behave better than clients that keep hammering.

---

*← Previous: **[Load Balancing Algorithms](/load-balancing-algorithms-how-traffic-is-distributed)** — the load balancer selects a backend for every request; the algorithm it uses determines fairness, latency, stickiness, and how it responds to heterogeneous server pools.*

*→ Next: **[Proxy vs Reverse Proxy](/proxy-vs-reverse-proxy-which-way-does-it-face)** — the intermediaries that sit between clients and servers, and why the direction matters.*