# Service Mesh: A Programmable Network for Microservices

> **Series:** System Design · Architecture Patterns — Pillar 7 of 8

## Systems Design

| # | Post | What it covers |
| --- | --- | --- |
| 00 | [Architecture Patterns: How Systems Are Structured](/architecture-patterns-how-systems-are-structured) | Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth. |
| 01 | [Monolithic Architecture: The Default That Gets Abandoned Too Early](/monolithic-architecture-the-default-that-gets-abandoned-too-early) | Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference. |
| 02 | [Microservices: The Architecture You Earn, Not Choose](/microservices-the-architecture-you-earn-not-choose) | Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it. |
| 03 | [Serverless: Pay for What You Use, Not What You Provision](/serverless-pay-for-what-you-use-not-what-you-provision) | Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in. |
| 04 | [Event-Driven Architecture: Decoupling Through Events](/event-driven-architecture-decoupling-through-events) | Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved. |
| 05 | [Message Queues: Decoupling Produce from Consume](/message-queues-decoupling-produce-from-consume) | Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ. |
| 06 | [Pub/Sub: Broadcasting Events to Multiple Consumers](/pubsub-broadcasting-events-to-multiple-consumers) | Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub. |
| 07 | [CQRS: When Reads and Writes Need Different Models](/cqrs-when-reads-and-writes-need-different-models) | CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't. |
| 08 | [Event Sourcing: The Ledger, Not the Balance](/event-sourcing-the-ledger-not-the-balance) | Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution). |
| 09 | [The Saga Pattern: Distributed Transactions Without Locks](/the-saga-pattern-distributed-transactions-without-locks) | The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each. |
| 10 | [The Outbox Pattern: Atomic Writes and Event Publishing](/the-outbox-pattern-atomic-writes-and-event-publishing) | The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling. |
| 11 | [The Circuit Breaker: Stopping Cascading Failures](/the-circuit-breaker-stopping-cascading-failures) | Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them. |
| 12 | [The Bulkhead Pattern: Containing Failures Through Resource Isolation](/the-bulkhead-pattern-containing-failures-through-resource-isolation) | Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice. |
| 13 | [The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes](/the-sidecar-pattern-cross-cutting-concerns-without-code-changes) | The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself. |
| 14 | **Service Mesh: A Programmable Network for Microservices** ← you are here | A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it. |
| 15 | [Service Discovery: Finding Services in a Dynamic Environment](/service-discovery-finding-services-in-a-dynamic-environment) | Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches. |
| 16 | [The Strangler Fig: Replacing a Legacy System Without Burning It Down](/the-strangler-fig-replacing-a-legacy-system-without-burning-it-down) | The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running. |
| 17 | [Backend for Frontend: One API Per Client Type](/backend-for-frontend-one-api-per-client-type) | BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it. |
| 18 | [ETL Pipelines: Moving Data from Operations to Analytics](/etl-pipelines-moving-data-from-operations-to-analytics) | ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale. |
| 19 | [Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?](/batch-vs-stream-processing-how-fresh-do-your-answers-need-to-be) | Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right. |
| 20 | [MapReduce: Processing Petabytes in Parallel](/mapreduce-processing-petabytes-in-parallel) | MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it. |
| 21 | [Architecture Patterns: Wrap-Up](/architecture-patterns-wrap-up) | A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect. |

* * *

# Service Mesh: A Programmable Network for Microservices

## The problem

You have twenty microservices. Each needs consistent retry logic, circuit breaking, timeouts, load balancing, mTLS, distributed tracing, and traffic routing for canary deployments. You've solved this with sidecars — every service has an Envoy proxy co-deployed.

But now you have twenty Envoy instances. How do you configure them consistently? How do you push a new circuit breaker policy across all twenty services at once? How do you do a canary deployment for the Link Service — routing 5% of traffic to the new version — without touching application code? How do you get a unified view of service-to-service traffic across the entire cluster?

Each sidecar in isolation is a useful tool. A fleet of sidecars with a shared control plane is a service mesh.

* * *

## The core idea

A service mesh is a dedicated infrastructure layer for service-to-service communication. It consists of a data plane (the fleet of sidecar proxies that handle actual traffic) and a control plane (the central management component that configures the proxies, distributes certificates, and collects telemetry). Together, they provide traffic management, security, and observability across all services — without changing a line of application code.

* * *

## The analogy: a managed road network

Independent roads (sidecars alone) work — cars get from A to B — but each driver must know the routes, obey their own traffic rules, and manage their own navigation.

A managed road network (service mesh) adds: traffic signals that can be reconfigured centrally, tolls that enforce access rules, surveillance cameras that feed into a central dashboard, variable speed signs for flow control. Cars (services) still drive — but the network manages the conditions.

* * *

## How a service mesh works

### Data plane (Envoy sidecars)

Every service pod has an Envoy sidecar that intercepts all inbound and outbound network traffic via iptables redirection. Envoy handles:

*   **Load balancing** across destination service instances
    
*   **Retries** on transient failures
    
*   **Circuit breaking** when a downstream service degrades
    
*   **mTLS** — all service-to-service traffic is encrypted and mutually authenticated
    
*   **Request tracing** — Envoy generates spans and propagates trace headers automatically
    

The service code makes a plain HTTP call to `analytics-service:8080`. Envoy intercepts it, adds mTLS, applies retry policy, emits a trace span, and forwards it. From the service's perspective: a normal HTTP call.

### Control plane (Istio / Linkerd)

The control plane manages the Envoy fleet:

*   **Certificate authority:** issues and rotates mTLS certificates for every service identity automatically
    
*   **Configuration distribution:** pushes routing rules, retry policies, circuit breaker settings to each Envoy instance via the xDS API
    
*   **Telemetry aggregation:** collects metrics (request rate, error rate, latency) from every Envoy and exposes them to Prometheus/Grafana

%%[service-mesh-architecture-widget] 

<!--
```plaintext
Control Plane (Istiod):
  → Pushes routing config to all Envoy sidecars
  → Issues mTLS certificates (rotated every 24h)
  → Aggregates telemetry

Envoy sidecar (link-service pod):
  ← Config from control plane
  → Handles all traffic to/from link-service
  → Reports metrics and traces

Envoy sidecar (analytics-service pod):
  ← Config from control plane
  → Handles all traffic to/from analytics-service
  → Reports metrics and traces
```
-->

### Traffic management

The control plane enables sophisticated traffic routing that would otherwise require application code or multiple load balancer rules:

**Canary deployment:**

```yaml
# Route 95% of traffic to stable, 5% to canary — no application changes
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: link-service
spec:
  http:
  - route:
    - destination:
        host: link-service
        subset: stable
      weight: 95
    - destination:
        host: link-service
        subset: canary
      weight: 5
```

This is how the above setup looks on a diagram:

%%[service-mesh-canary-widget] 

**Fault injection (for chaos engineering):**

```yaml
# Inject a 5-second delay on 10% of requests to analytics-service for testing
http:
- fault:
    delay:
      percentage:
        value: 10
      fixedDelay: 5s
  route:
  - destination:
      host: analytics-service
```

**Header-based routing:** route requests with header `X-Beta-User: true` to a beta version of the service.

**Retry and timeout policies:** configure global defaults applied to all services without touching service code.

* * *

## Tradeoffs

**Power vs complexity.** A service mesh provides capabilities (mTLS everywhere, canary deployments, fault injection, unified observability) that would otherwise require significant application code investment — or be impossible without it. The cost is substantial: the control plane is a complex distributed system in its own right. Istiod has failure modes, upgrade procedures, and operational quirks that must be understood.

**Resource overhead.** Each Envoy sidecar consumes 50–200MB of RAM and adds ~1–3ms of latency per service call. In a cluster with hundreds of pods, this is meaningful infrastructure cost.

**The complexity cliff.** For small microservices deployments (under 5–10 services), a service mesh is almost certainly overkill — the operational overhead exceeds the benefit. The breakeven point is different for every organisation but most teams don't need one until they have genuine at-scale problems with service-to-service security or traffic management.

**Linkerd vs Istio:** Linkerd is simpler, lighter (Rust-based proxy), and easier to operate. Istio is more powerful, more configurable, and more complex. Linkerd is generally the better starting point; Istio fits organisations with complex traffic management requirements.

%%[service-mesh-linkerd-vs-istio-widget] 

* * *

## The one thing to remember

> **A service mesh solves service-to-service communication at scale by moving networking concerns (security, routing, observability, resilience) out of application code and into a managed infrastructure layer.** The data plane (sidecar proxies) handles traffic; the control plane manages configuration and certificates. You pay for this in operational complexity and resource overhead — only justified when you have a genuine fleet of services with cross-cutting communication requirements that can't be handled by simpler means.

* * *

*← Previous:* [***Sidecar***](/the-sidecar-pattern-cross-cutting-concerns-without-code-changes) *— deploying a helper process alongside each service to handle cross-cutting concerns like logging, metrics, and service discovery.*

*→ Next:* [***Service Discovery***](/service-discovery-finding-services-in-a-dynamic-environment) *— in a dynamic environment where service instances start and stop constantly, how do services find each other?*