Skip to main content

Command Palette

Search for a command to run...

Materialized Views: Pre-Computing Expensive Queries

Updated
11 min readView as Markdown
Materialized Views: Pre-Computing Expensive Queries

Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

# Post What it covers
00 Data & Storage: Where Everything Lives Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01 SQL vs NoSQL: Choosing the Right Database SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02 Database Indexing: The Highest-Leverage Performance Tool Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03 B-Trees & B+ Trees: The Data Structure Behind Database Indexes Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04 LSM Trees: Why Some Databases Are Built for Writes LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05 Denormalisation: Trading Storage for Speed Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06 Database Sharding: Scaling Beyond a Single Node Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07 Data Partitioning: Choosing How to Divide Your Data Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08 Consistent Hashing: Minimising Resharding Pain Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09 Replication & Read Replicas: Scaling Reads and Surviving Failures Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10 Object Storage: Unlimited Scale for Large Binary Data Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11 Block vs File vs Object Storage: Three Models, Three Use Cases Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12 Distributed File Systems: File Storage Across Many Machines Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13 Time Series Databases: Built for Metrics and Events Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14 Vector Databases: Semantic Search and AI Memory Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15 Full-Text Search Engines: Beyond SQL LIKE Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16 Materialized Views: Pre-Computing Expensive Queries ← you are here Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17 Query Optimisation: From Slow to Fast Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18 Connection Pooling: Managing the Hidden Bottleneck Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19 Data & Storage: Wrap-Up A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Materialized Views: Pre-Computing Expensive Queries

The problem

Your URL shortener's analytics dashboard shows a "top links" table: the 50 links with the most clicks in the last 30 days, with per-day breakdowns. The query:

SELECT
  l.id, l.short_code, l.link_name,
  COUNT(c.id) AS total_clicks,
  COUNT(DISTINCT c.country) AS country_count,
  DATE(c.clicked_at) AS click_date,
  COUNT(c.id) AS daily_clicks
FROM links l
JOIN clicks c ON c.link_id = l.id
WHERE c.clicked_at >= NOW() - INTERVAL '30 days'
  AND l.user_id = 123
GROUP BY l.id, l.short_code, l.link_name, DATE(c.clicked_at)
ORDER BY total_clicks DESC
LIMIT 50;

On a clicks table with 10 billion rows, this query takes 45 seconds. That's unacceptable for a dashboard that loads on every page visit.

You've tried:

  • Indexes on (user_id, clicked_at) — helped, but still 8 seconds
  • Denormalisation (a click_count column on links) — covers the total count but not the per-day breakdown or the 30-day filter
  • Read replicas — the same slow query, just on a different node

The issue isn't how the query is executed — it's that the underlying computation is expensive, and you're running it thousands of times per day on data that changes by at most a few percentage points between dashboard loads.


The core idea

A materialized view stores the result of a query as a physical table. Instead of running the expensive query on demand, you pre-compute it and store the result. Dashboard reads become fast simple lookups on a pre-built result set. The materialized view is refreshed on a schedule or on demand, accepting that the data may be slightly stale in exchange for fast reads.


The analogy: the printed summary report

An organisation's finance team runs a complex analysis every morning: last night's transactions, reconciled against budgets, broken down by department and cost centre. The analysis takes 30 minutes to run.

If every executive who wants to see "how did we do yesterday?" triggers the 30-minute analysis themselves, 20 executives × 30 minutes = 10 hours of compute time. And they're all looking at the same data.

The solution: run the analysis once at 7am, print the report, distribute it. Everyone reads the printed report. It's from 7am — not real-time — but for "how did we do yesterday?" that's fine.

A materialized view is the printed report. The refresh process is the 7am analysis run. Trading freshness for scale.


How it works

Creating a materialized view in PostgreSQL

CREATE MATERIALIZED VIEW mv_user_link_analytics AS
SELECT
  l.id AS link_id,
  l.short_code,
  l.link_name,
  l.user_id,
  DATE(c.clicked_at) AS click_date,
  COUNT(c.id) AS daily_clicks,
  COUNT(DISTINCT c.country) AS unique_countries
FROM links l
JOIN clicks c ON c.link_id = l.id
WHERE c.clicked_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY l.id, l.short_code, l.link_name, l.user_id, DATE(c.clicked_at);

-- Add indexes to the materialized view for fast dashboard queries
CREATE INDEX idx_mv_user ON mv_user_link_analytics(user_id);
CREATE INDEX idx_mv_link_date ON mv_user_link_analytics(link_id, click_date DESC);

The dashboard query now becomes:

SELECT link_id, short_code, link_name, click_date, daily_clicks
FROM mv_user_link_analytics
WHERE user_id = 123
ORDER BY daily_clicks DESC
LIMIT 50;

This query runs in milliseconds — it reads from a pre-built, indexed table rather than aggregating billions of click rows on demand.

Refreshing the materialized view

The materialized view is a snapshot. It goes stale as new data arrives. Two refresh strategies:

Full refresh (REFRESH MATERIALIZED VIEW):

-- Refresh the entire materialized view
REFRESH MATERIALIZED VIEW mv_user_link_analytics;

Rebuilds the entire view from scratch by re-running the defining query. Simple, always produces a perfectly up-to-date result. Cost: the refresh runs the original expensive query — if that takes 45 seconds, the refresh takes 45 seconds.

Concurrent refresh (REFRESH MATERIALIZED VIEW CONCURRENTLY):

-- Refresh without blocking reads
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_user_link_analytics;

Builds a new snapshot and swaps it with the old one atomically. The old view remains readable during the refresh — no downtime for the dashboard. Requires a unique index on the materialized view.

Refresh scheduling

PostgreSQL doesn't have built-in scheduled refresh — it must be triggered externally:

# Run via pg_cron (PostgreSQL extension) or an external scheduler
# Refresh every 15 minutes
SELECT cron.schedule('*/15 * * * *', 'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_user_link_analytics');

Common schedules:

  • Dashboard data that changes rapidly: every 5 minutes
  • Daily report data: once per day at low-traffic hours
  • Aggregate data over historical periods: weekly

Materialized views in other systems

BigQuery: materialized views are refreshed automatically and incrementally when the base tables change — no manual refresh needed. BigQuery tracks which rows changed and updates only the affected portions of the materialized view.

Snowflake: similar to BigQuery — automatic incremental refresh on source table changes.

Redshift: supports materialized views with automatic refresh for some query patterns.

dbt (data build tool): not a database feature, but a workflow tool. dbt models can be materialised as physical tables in your data warehouse on a schedule — effectively managing materialised views as code with version control, testing, and documentation.


When to use a materialized view vs alternatives

Approach Good for Not for
Materialized view Expensive aggregate queries run frequently on mostly-static data Queries needing real-time freshness
Denormalisation Frequently-accessed columns that join is needed for Complex multi-table aggregations
Index Speeding up WHERE, JOIN, ORDER BY on existing queries Pre-computing aggregates
Caching (Redis) Application-layer caching with TTL and custom invalidation SQL queries that must run against fresh data
OLAP database (ClickHouse) Very large scale aggregation analytics Transactional queries

Materialized views sit between denormalisation (schema-level redundancy) and application caching (external to the database). They're the right tool when:

  • The computation is SQL (a view is already the right abstraction)
  • The source data is in the same database
  • Slight staleness is acceptable (minutes to hours, not seconds)
  • You want the database to manage the result, not application code

The tradeoffs

Staleness. A materialized view is a snapshot. Users querying the dashboard between refreshes see data that's up to refresh_interval old. For most analytics use cases, 5–15 minute lag is acceptable. For "did my last click just register?" style queries, it isn't.

Storage cost. The materialized view stores the result set on disk — the same data that exists in the source tables, in pre-aggregated form. On large datasets, materialized views can be hundreds of GB.

Refresh cost. Full refresh re-runs the defining query. If the defining query is expensive, the refresh is expensive too — potentially as expensive as the original query, but amortised across all reads since the last refresh.

Schema coupling. Materialized views depend on their source tables. Changes to the source table schema (renaming columns, adding/removing tables from a join) require updating and refreshing the view.

Refresh contention. Without CONCURRENTLY, a refresh locks the view for reads. For dashboards with many concurrent users, this can cause visible latency spikes during refresh windows.


The one thing to remember

A materialized view is a pre-printed report: the expensive computation happens once, on a schedule, and fast reads serve the result. The tradeoff is freshness — you're serving data from the last refresh, not the latest write. When your most expensive, most frequent queries aggregate large datasets for analytics or reporting, and your users can accept data that's minutes (not seconds) old, a materialized view is often the cleanest, lowest-effort solution.


← Previous: Full-Text Search Engines: Beyond SQL LIKE — Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text...

→ Next: Query Optimisation: From Slow to Fast — Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and sys...

Systems Design

Part 1 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.