Series: System Design · Caching — Pillar 5 of 8

Systems Design

#	Post	What it covers
00	Caching: The Fastest Database Query Is the One You Don't Make	Caching is one of the most impactful and error-prone tools in system design. Six concepts covering the full lifecycle of a production cache layer.
01	Caching: Storing Results Closer to Where They're Needed	Caching stores expensive results closer to the reader. Learn how it works, the main patterns, and when it hurts more than it helps.
02	Cache Invalidation: Knowing When the Copy Is Wrong ← you are here	Cache invalidation is notoriously difficult. Learn the main strategies, when each applies, and how to avoid serving stale data at scale.
03	Distributed Cache: Spreading Cache Across a Cluster	A single cache node is a bottleneck and a SPOF. Learn how distributed caches partition data, replicate for availability, and handle node failures.
04	Cache Eviction Policies: What Gets Thrown Out When the Cache Is Full	When a cache fills up, something must go. Learn how LRU, LFU, FIFO, and TTL-based eviction work and how to choose the right policy for your data.
05	Cache Stampede: When Expiry Triggers a Database Avalanche	When a hot cache entry expires, hundreds of servers query the database simultaneously. Learn how cache stampedes happen and how to prevent them.
06	Cache Warming: Starting Hot Instead of Cold	A cold cache causes database overload on startup. Learn how to warm caches proactively using predictive loading, lazy warming, and scheduled jobs.
07	Caching: Wrap-Up	A recap of all 6 caching concepts: what caching is, invalidation strategies, distributed caches, eviction policies, stampedes, and warming. How they connect.

Cache Invalidation: Knowing When the Copy Is Wrong

The problem

Phil Karlton's famous quip — "there are only two hard things in computer science: cache invalidation and naming things" — has survived decades because it's true.

Your URL shortener caches destination URLs in Redis with a one-hour TTL. A user creates a link to a press release. Twenty minutes later, the press release URL changes — the company restructured their site. The user updates the destination in your dashboard. The database is updated. But the cache still holds the old URL. For the next forty minutes, anyone who clicks that short link gets a 404.

TTL solved the database load problem. It created a correctness problem. The cache has a copy of data that no longer reflects reality, and it has no idea.

Invalidation is the set of strategies for keeping a cache consistent with its source. It's hard because the cache and the source are separate systems updated independently, and coordination between them is never free.

The core idea

Cache invalidation is the process of removing or updating cache entries when the underlying data changes, so that future reads get fresh data rather than stale copies. There is no single right strategy — the right approach depends on how often data changes, how harmful stale reads are, and how much complexity you can afford.

The analogy: a printed train timetable

A printed train timetable is a cache of the train schedule. It was accurate on the day it was printed. The moment the rail operator changes a service — adds a stop, adjusts a departure time — the timetable becomes stale. Every commuter relying on it has the wrong data.

Invalidation strategies map directly to how the timetable problem is solved:

TTL: "The timetable expires at the end of the quarter. Get a new one then." Stale for weeks, but low effort.
Purge on change: "Every time a service changes, we reprint and redistribute the relevant page." Immediate accuracy, high coordination cost.
Event-driven: "We publish a bulletin whenever a service changes. Commuters with the app get push notifications." Accurate and efficient — but requires infrastructure.
Versioning: "Timetables are stamped with a version. Always request the current version, not a cached copy."

No single strategy suits all commuters. Neither does any single strategy suit all data in a system.

Invalidation strategies

TTL (time-to-live)

Set an expiry on every cache entry. When it expires, the next read fetches fresh data from the source and re-populates the cache.

redis.setex("url:x7Kp2", ttl=60, value="https://example.com/press-release")
# Entry expires after 60 seconds regardless of whether the data changed

Strengths: Simple. No coordination required. Self-healing — stale entries expire automatically.

Weaknesses: Blunt. Data is stale for up to TTL duration after the source changes. Short TTLs reduce staleness but increase cache misses (and origin load). Long TTLs improve hit ratio but increase potential stale window.

When to use: Data that changes infrequently; use cases where brief staleness is acceptable (destination URL redirects, configuration values, product catalogue pages).

Active invalidation (cache busting / purge)

When data changes, explicitly delete or update the corresponding cache entry.

def update_destination(short_code, new_url):
    # 1. Update the database
    db.execute("UPDATE links SET destination = ? WHERE code = ?", new_url, short_code)

    # 2. Invalidate the cache entry immediately
    redis.delete(f"url:{short_code}")
    # Next read will be a cache miss and populate fresh data

Strengths: Immediate consistency. The stale window is the time between the write and the delete — typically milliseconds.

Weaknesses: The cache invalidation must be co-located with the write. If the write and the delete are in different services, coordination is required. If the delete fails (network error, crash between write and delete), the cache remains stale until TTL expires. The two-step nature introduces a brief inconsistency window.

When to use: Data with a clear owner and a single write path; when staleness is unacceptable (user profile changes, permission changes, price updates).

Write-through invalidation

Every write goes to the cache and the database simultaneously. The cache is always current because it's updated on every write.

def update_destination(short_code, new_url):
    # Write to both, atomically where possible
    db.execute("UPDATE links SET destination = ? WHERE code = ?", new_url, short_code)
    redis.setex(f"url:{short_code}", ttl=3600, value=new_url)

Strengths: Cache is always fresh after a write. No separate invalidation step.

Weaknesses: Every write touches the cache, even for data that is never subsequently read. Cache can be populated with cold data. Failure atomicity: if the DB write succeeds but the cache write fails (or vice versa), they're inconsistent.

When to use: When writes and reads are on the same code path; when the cache and database can be updated together reliably.

Event-driven invalidation

The database or a service emits events when data changes. Cache consumers subscribe to these events and invalidate or refresh affected entries.

Link updated (short_code: x7Kp2) → Kafka topic: link.updated
                                  → Cache invalidation service subscribes
                                  → Deletes redis key: url:x7Kp2

This is powerful for distributed systems where the writer and the cache owner are separate services. Change Data Capture (CDC) tools (like Debezium) can watch database changes and publish events without application code changes.

Strengths: Decoupled. The writer doesn't need to know about all caches. Multiple caches can subscribe to the same event stream. Near-real-time invalidation without polling.

Weaknesses: Eventual consistency — there's a propagation delay between the write and the invalidation. Event delivery guarantees matter: if an event is lost, the cache remains stale indefinitely. Infrastructure complexity (requires a message broker).

When to use: Microservices where the writer and cache owner are separate; when CDC is already in use; when you need to invalidate multiple downstream caches from a single write event.

Cache versioning / cache keys with version

Instead of invalidating the old entry, make the key version-specific. On update, increment the version. Old entries become unreachable (and eventually evicted); new reads get the new version.

# Version 1
redis.set("url:x7Kp2:v1", "https://example.com/old-path")

# After update:
redis.set("url:x7Kp2:v2", "https://example.com/new-path")
# v1 entry is orphaned — it will expire and be evicted

# The application tracks the current version
redis.set("url:x7Kp2:version", "v2")

A simpler variant uses a hash of the underlying data as part of the key — if the data changes, the hash changes, and the new key has no entry to serve.

Strengths: No race condition between invalidation and a concurrent read populating the old value. Old and new versions can coexist briefly — useful for rolling deployments.

Weaknesses: Orphaned old entries accumulate until eviction. Version tracking adds complexity. Not suitable when you need immediate expiry of old data.

Cache Invalidation: What Goes Wrong

The invalidation race condition

Active invalidation has a subtle race condition worth naming explicitly:

Thread A: Read cache miss for url:x7Kp2
Thread B: Write new destination, delete cache entry url:x7Kp2
Thread A: Fetch from DB (gets OLD value — B's write is in-flight)
Thread A: Write OLD value to cache with TTL=3600

Result: Cache holds stale value for another hour

Thread A fetched from the database after Thread B's write was initiated but before it committed. A fetched the pre-write value and wrote it to the cache after B's invalidation — overwriting B's delete.

Mitigations:

Use a short TTL as a backstop even with active invalidation
Use database transactions and only invalidate the cache after the transaction commits
Use compare-and-swap (CAS) or Lua scripts to atomically check-and-set cache entries
Accept a small stale window and document it

Tradeoffs

Consistency vs simplicity. TTL alone is simple but eventually consistent. Active invalidation is more consistent but adds coupling between writes and cache management. Event-driven is powerful but requires infrastructure.

Stale window vs cache miss rate. Aggressive invalidation (short TTL, frequent purges) keeps data fresh but increases cache misses, which increases origin load. Less aggressive invalidation improves hit ratio but risks serving stale data.

Coordination cost. Every strategy that improves consistency adds coordination — between the write path and the cache, between services, between the cache and an event broker. Coordination means more failure modes.

When to use it / when not to

TTL alone is sufficient when:

Staleness of minutes is acceptable (redirects, content pages, configuration)
Write frequency is low relative to read frequency

Active invalidation is needed when:

Stale reads cause real harm (permission changes, payment states, inventory)
Write patterns are predictable and the write path owns the cache key

Event-driven invalidation fits when:

Writes happen in a different service than cache ownership
Multiple caches must invalidate on the same event
CDC infrastructure is already in place

In the URL shortener: destination URL caches use TTL + active invalidation (on update, purge the key). User permission caches use active invalidation (permissions change rarely, but when they do it must be immediate). Analytics aggregates use TTL only (brief staleness in dashboards is acceptable).

The one thing to remember

Cache invalidation is fundamentally a distributed consistency problem. The cache and the source are two systems that can diverge. Every invalidation strategy is a tradeoff between how long they can diverge (stale window), how expensive it is to keep them aligned (coordination cost), and how complex the implementation becomes. Use TTL as a backstop always — even when you have active invalidation. It's the last line of defence against a bug in your invalidation logic.

← Previous: Caching — the fundamentals of storing results closer to the reader

→ Next: Distributed Cache — a single cache server is a single point of failure; here's how to spread cache across a cluster and what that changes.

Cache Invalidation: Knowing When the Copy Is Wrong

Systems Design

Cache Invalidation: Knowing When the Copy Is Wrong

The problem

The core idea

The analogy: a printed train timetable

Invalidation strategies

TTL (time-to-live)

Active invalidation (cache busting / purge)

Write-through invalidation

Event-driven invalidation

Cache versioning / cache keys with version

Cache Invalidation: What Goes Wrong

The invalidation race condition

Tradeoffs

When to use it / when not to

The one thing to remember

Comments

Systems Design

More from this blog

Docker & Kubernetes: What They Are, Why They Matter, and How to Get Started

Introduction to Rancher: Wrangling Kubernetes Clusters at Scale

Networking Fundamentals: A Beginner's Guide to How the Internet Actually Works

Distributed Systems: Wrap-Up

Observability: Understanding Your System at Runtime

Command Palette

Systems Design

Cache Invalidation: Knowing When the Copy Is Wrong

The problem

The core idea

The analogy: a printed train timetable

Invalidation strategies

TTL (time-to-live)

Active invalidation (cache busting / purge)

Write-through invalidation

Event-driven invalidation

Cache versioning / cache keys with version

Cache Invalidation: What Goes Wrong

The invalidation race condition

Tradeoffs

When to use it / when not to

The one thing to remember

Comments

Systems Design

More from this blog