# Blameless Post-Mortems: How to Turn Outages Into the Best Learning Your Team Gets

# Blameless Post-Mortems: How to Turn Outages Into the Best Learning Your Team Gets

> **Series: The Modern SDLC** · Post 15 of 17 *←* [*Post 14: Alerting and On-Call*](/alerting-that-doesn-t-burn-out-your-team) *·* [*Post 16: Platform Engineering and FinOps →*](/cloud-costs-and-platform-engineering-making-the-right-thing-the-default-thing)

* * *

Every engineering team has incidents. The difference between teams that improve over time and teams that have the same incidents repeatedly isn't whether they have outages — it's what they do after them.

A production incident is, in the right conditions, the best learning opportunity an engineering team gets. It's a real failure in a real system under real conditions, with real consequences that make the lesson stick. The conditions that turn an incident into learning are: honest analysis, shared understanding, and action that addresses the systemic cause rather than the proximate one.

The conditions that prevent learning are blame, defensiveness, and post-mortems that produce a list of twelve improvements nobody is assigned to complete.

Most teams are closer to the second version than they'd like to admit. The retrospective happens. The findings are documented. The document lives in Confluence and is never opened again. The same incident recurs six months later and the team discovers the previous post-mortem in their investigation.

This post covers how to build the first version: incident management that minimises impact while it's happening, and post-mortems that produce systemic improvement when it's over.

* * *

## The one thing to remember

**Incidents are inevitable. The measure of an engineering team is not whether they have outages — it's how fast they recover and how thoroughly they learn. Every incident that doesn't produce lasting improvement is an incident that happened twice.**

* * *

## The incident lifecycle

An incident moves through five phases. Each phase has a distinct goal, and confusing the goals of one phase with another is a reliable source of extended incidents and incomplete learning.

**Detect** — something is wrong and the team knows it. Alert fires, user reports, anomaly spotted by an engineer. The goal is speed: time to detection is the start of impact. Declare an incident early — the cost of a false declaration is five minutes of wasted time. The cost of a delayed declaration is extended user impact.

**Respond** — the team mobilises. Assign an Incident Commander, open a dedicated channel, triage severity, notify stakeholders. The goal is coordination: getting the right people working on the right things without duplication or gaps.

**Mitigate** — stop the bleeding. The goal is to end user impact as fast as possible, using whatever means are available. Mitigation is not root cause analysis. Roll back the deployment. Turn off the feature flag. Scale the service. Redirect traffic. Mitigating first and diagnosing second is the discipline that separates teams with five-minute MTTRs from teams with ninety-minute ones.

**Resolve** — confirm recovery, monitor stability, declare the all-clear. The goal is certainty: the system is healthy, the metrics confirm it, and the team has watched it for long enough to be confident.

**Learn** — post-mortem, root cause analysis, action items. The goal is systemic improvement: understanding not just what broke, but why the conditions existed for it to break, and changing those conditions.

* * *

## The Incident Commander: the role most teams skip

The single most impactful structural change a team can make to their incident response is designating an Incident Commander.

Without an IC, what happens during a major incident is: everyone in the channel simultaneously investigates, simultaneously proposes solutions, simultaneously communicates with stakeholders. Nobody is coordinating. Critical steps get missed. Good ideas get lost in the noise. The person who shouts loudest drives decisions rather than the person with the clearest picture. Meanwhile the incident continues.

The IC doesn't investigate. The IC coordinates. They assign work to specific people, keep the team focused on the most likely resolution path, prevent rabbit holes, make the call when there's a choice between two approaches, and ensure stakeholders are updated on a cadence.

The IC role is not a technical role — it's a coordination role. The IC doesn't need to be the most senior engineer or the most familiar with the affected system. They need to be calm, organised, decisive, and authorised to make calls. These qualities are separable from technical depth, and confusing them is why teams default to having the most experienced engineer investigate, coordinate, and communicate simultaneously — which produces worse outcomes on all three dimensions.

**Roles in a well-run incident:**

**Technical Lead** — drives the investigation and remediation. Heads-down on the problem. Reports findings to the IC. Pulls in specialists when needed.

**Communications Lead** — updates the status page, posts stakeholder updates on cadence, manages external communication. Shields the technical team from inbound questions during the incident.

**Scribe** — documents the timeline in real time: what was observed, what was tried, what was decided, and when. The scribe's notes are the raw material for the post-mortem.

For smaller incidents, roles can be combined — the IC might also handle communications. The non-negotiable is that investigation and coordination are separated. One person leads the technical work; a different person (or the same person wearing a clearly different hat) keeps the response organised.

* * *

## Mitigation first: the discipline that halves your MTTR

The most common mistake during an active incident is trying to find root cause before stopping the bleeding. It feels responsible — you want to understand what happened before changing things. In practice, it extends user impact by thirty minutes while the team diagnoses a problem they could have stopped.

The question the IC should ask at the start of every incident: what is the fastest path to restoring service, independent of understanding why this happened?

The options, roughly in order of speed and safety:

**Feature flag off** — sub-second, no deployment, no risk. If the impacted feature is behind a flag, this is always the first action. The investigation can continue after service is restored.

**Rollback** — minutes, requires the previous version to be safe. If a recent deployment is the suspected cause and rollback is straightforward, do it first and diagnose second. You can always re-deploy after understanding the problem.

**Traffic redirect** — route away from the affected region, availability zone, or service instance. If the problem is isolated, reducing blast radius while investigating is better than waiting for a fix.

**Scale up** — if resource exhaustion is the cause, scaling is often faster than identifying why resources are being exhausted.

**Hotfix** — slowest, highest risk, last resort during an active incident. Writing new code under pressure with reduced context and urgency-induced tunnel vision produces bugs. Hotfixes should be reserved for cases where no other mitigation exists.

**Declare mitigation and resolution separately.** Mitigation is when the immediate user impact stops — rollback applied, flag off, traffic redirected. Resolution is when the underlying cause is fixed and the proper solution is deployed. Track both timestamps. MTTM (mean time to mitigate) is the metric users experience. MTTR (mean time to resolve) is the metric the team works toward. Conflating them makes both worse.

* * *

## The incident channel and timeline

Open a dedicated channel immediately on incident declaration: `#inc-2024-11-15-payment-degradation`. All incident communication goes here — no side channels, no DMs, no "quick calls" that produce decisions invisible to the rest of the team.

This serves two purposes. During the incident, it ensures everyone working on the problem has the same information. After the incident, it provides an automatic timeline of events that becomes the foundation of the post-mortem. Every hypothesis, every action, every decision, every command run — captured in the channel with timestamps.

The scribe's job is to make sure the important things land in the channel, even when the pace is fast. "14:38 — hypothesis: new Stripe SDK version changed error format. Testing rollback." "14:41 — IC decision: rolling back to v2.4.0." "14:47 — error rate dropping: 4.8% → 2.1% → 0.3%."

Tools like Rootly, FireHydrant, and PagerDuty automate channel creation, timeline capture, and status page updates when an incident is declared. Worth the investment for teams who run incidents frequently — they reduce the overhead of the coordination work and produce better documentation automatically.

* * *

## Stakeholder communication: silence is always worse than uncertainty

During a SEV1 or SEV2, stakeholders — product leadership, customer success, support teams — need to know what's happening. They don't need technical detail. They need status, impact, and when to expect the next update.

The cadence that works: update every fifteen to thirty minutes for SEV1, every thirty minutes for SEV2. Even if the update is "still investigating, no change in status." Silence communicates that nobody is working on it or that it's worse than previously stated. Regular updates — even null updates — communicate that the team is engaged and the situation is understood.

The template: current status, what we know about the impact, what the team is doing, when the next update will come. "We're experiencing elevated error rates on payment processing affecting approximately 30% of checkout attempts. The team has identified a likely cause and is working on a rollback. We expect to have an update within twenty minutes."

Update the status page with the same information, publicly, for any user-facing impact. This reduces support ticket volume from users who notice problems and assume the worst when they don't see acknowledgement.

* * *

## The blameless post-mortem

A blameless post-mortem starts from one premise: engineers are competent professionals who made reasonable decisions with the information available to them at the time. When something goes wrong, the question is not "who made the mistake" — it's "what conditions made this mistake possible, and how do we change those conditions."

This isn't naivety about accountability. It's an understanding of how to actually prevent recurrence. Fixing the person — training, reprimand, process enforcement — prevents that person from making that specific mistake again in that specific situation. Fixing the conditions prevents the next person from making the same class of mistake in any situation. The second intervention is orders of magnitude more effective.

The practical consequence: in a blameless post-mortem, there are no "the engineer should have" statements. Every finding is framed as a systemic condition. "The engineer didn't check the migration compatibility" becomes "no automated check verifies migration compatibility before deployment." The second framing points directly at the fix. The first creates defensiveness and points at a person.

**Psychological safety is not optional.** If engineers believe that honesty in a post-mortem will result in blame, career consequences, or management scrutiny, they will not be honest. They will present events in the most favourable light. The most important information — the near-misses, the shortcuts taken under pressure, the known risks that were accepted — will stay hidden. And the post-mortem will produce recommendations that address the sanitised version of events rather than what actually happened.

The engineering manager's role in a post-mortem is to ask questions, not to defend decisions or assign accountability. Their presence alone changes the dynamic if they hold authority over the engineers in the room. Consider a neutral facilitator — a senior engineer from another team, an SRE — for major incidents where the stakes of honesty feel high.

* * *

## Root cause analysis: the 5 Whys

The proximate cause is what broke. The root cause is why the conditions existed for it to break. Fixing only the proximate cause prevents this exact incident from recurring. Fixing the root cause prevents the class of incidents it represents.

The 5 Whys is the structured technique for getting from the proximate cause to the systemic one. For each answer, ask "why?" again until you reach a cause that points to a process, a policy, a resource constraint, or an organisational condition — not a person.

An example, following the chain:

**Why did payment processing fail?** The Stripe SDK upgrade introduced an undocumented breaking change in the error response format.

**Why didn't we catch it before production?** Our tests mock the Stripe error responses using the old format. They passed despite the real format changing.

**Why do our tests mock Stripe responses rather than using the sandbox?** Integration tests against the real Stripe sandbox were considered too slow for CI and the setup was complex.

**Why was the slow setup never addressed?** There was no team standard for integration testing third-party APIs, and no one owned the decision.

**Why is there no standard?** We've never had an incident caused by a third-party API change before. The risk wasn't visible until now.

The fix at Why 1: update the error handling code. The fix at Why 5: establish a policy for third-party API integration testing and assign ownership. Only the second fix prevents the class of incidents this represents — any SDK upgrade, any external API change, by any third party. Most teams stop at Why 1 or Why 2.

The 5 Whys can be applied in parallel for different contributing factors. An incident rarely has a single cause — it's usually a combination of conditions that aligned. Each chain reveals a different systemic gap, and the complete picture is more valuable than any single chain.

* * *

## Action items: the only output that matters

A post-mortem that produces a document nobody reads is worth nothing. A post-mortem that produces a document that sits in Confluence for three months until the next incident is worth less than nothing — it creates the illusion of learning without the substance.

The only output that matters is action items: specific, assigned, time-bound changes to the system, process, or tooling.

Four categories of action item are worth distinguishing:

**Prevention** — changes that make this class of incident less likely. Adding integration tests against the real Stripe sandbox. Implementing schema compatibility checks in CI. Adding database query performance tests.

**Detection** — changes that make this class of incident faster to detect. Adding an alert for the specific symptom pattern. Improving the dashboard to surface the relevant metric. Adding structured log fields that would have made the root cause visible earlier.

**Mitigation** — changes that make this class of incident faster to stop. Adding a feature flag to the affected functionality. Improving rollback speed. Documenting the specific mitigation steps in the runbook.

**Process** — changes to how the team works. Establishing a policy for third-party API testing. Requiring a load test before deploying to production. Adding a runbook review step to the release checklist.

**The discipline:** limit to three to five action items per post-mortem. More than that and none get done — the volume itself signals that prioritisation hasn't happened. Vote on the highest-impact items, assign each to a named individual, set a due date, and create tickets in the team's backlog before the meeting ends. Not after. Not as a follow-up. Before the meeting ends.

Open the next post-mortem by reviewing the previous one's action items. Were they completed? Did they help? This accountability converts post-mortems from an exercise in documentation into an engine for improvement.

* * *

## The post-mortem document

The post-mortem document is a record, not a report. It captures what happened, what was learned, and what will be done — in enough detail that someone who wasn't involved can understand the full story. It should be internally published and accessible to any engineer in the organisation.

A structure that works:

```plaintext
# Post-Mortem: [Service/Feature] — [Date]

Status: Action items in progress / Complete
Severity: SEV[N]  |  Duration: [X] minutes  |  Impact: [description]

## Summary
Two to three sentences. What broke, what caused it, how it was resolved.

## Timeline
[Time] — Event or observation
[Time] — Decision or action taken
[Time] — Result observed
...

## Root cause
One paragraph. The systemic condition, not the proximate event.

## What went well
Honest. Things that worked: fast detection, effective rollback, good communication.

## What went badly
Honest. Things that didn't work: slow diagnosis, missing runbook, inadequate alerts.

## Action items
| Action | Category | Owner | Due | Ticket |
|--------|----------|-------|-----|--------|
| Add Stripe sandbox integration tests | Prevention | @alice | 2024-11-29 | ENG-4821 |

## Contributing factors
The conditions that aligned to make this incident possible.
```

**Publish post-mortems internally.** Transparency about failures builds trust, normalises the reality that incidents happen, and spreads learning across teams that might face similar conditions. Some companies — Cloudflare, Stripe, GitHub — publish major post-mortems publicly. This is worth considering for any incident affecting external users. A company that publishes an honest analysis of what went wrong and how they're preventing recurrence builds more customer trust than a company that goes silent.

* * *

## What goes wrong when incident management is broken

**No IC — everyone coordinates simultaneously.** The incident lasts twice as long as it should because decisions aren't being made, work is being duplicated, and communication is chaotic. The most experienced engineer is trying to investigate, coordinate, and update stakeholders at the same time and doing all three poorly.

**Diagnosis before mitigation.** The team spends forty-five minutes identifying root cause while users experience failures. A rollback that could have been executed in five minutes is delayed because nobody wanted to change things before understanding why. The root cause could have been found after restoring service.

**Blame in the post-mortem.** The post-mortem identifies the engineer who made the change and the tone — even if unspoken — is that the problem was them. Future post-mortems have less honest disclosure. The systemic conditions that allowed the mistake are never addressed. The incident happens again with a different engineer in the role.

**Post-mortem as theatre.** The document is written, the action items are listed, the meeting ends. Nobody is assigned. No tickets are created. The document sits unread. The next incident reveals that nothing changed. The team starts to treat post-mortems as a bureaucratic requirement rather than a learning mechanism.

**Action item overload.** The post-mortem produces fifteen action items. Nobody can prioritise fifteen action items on top of their existing workload. Some get done, most don't, and the most important ones have no more visibility than the least important ones. Three focused action items with owners and deadlines do more than fifteen unfocused ones without either.

* * *

## If you do one thing from this post

Before the next incident happens, define your severity levels and write them down somewhere the whole team can see them.

SEV1: production down or data loss or security breach — wake anyone, now. SEV2: major feature broken or significant percentage of users affected — page on-call now. SEV3: non-critical feature broken or small percentage affected — fix next business day. SEV4: minor issue or cosmetic — add to backlog.

Shared severity definitions mean everyone makes the same call when the same thing happens. Without them, one engineer pages the whole team for a SEV3 at midnight while another waits until morning to report a SEV2. The definition meeting takes thirty minutes. The clarity it provides is worth months of confusion.

* * *

*Next up:* [*Post 16 — Cloud Costs and Platform Engineering: Making the Right Thing the Default Thing*](/cloud-costs-and-platform-engineering-making-the-right-thing-the-default-thing)

*←* [*Post 14: Alerting That Doesn't Burn Out Your Team*](/alerting-that-doesn-t-burn-out-your-team)
