Blameless postmortems
publish Opened 2025-01-01. From Musings on tech.
I’ve worked on 3 separate Google SRE teams since joining in 2020. One thing I see the SRE org execute on really well is blameless postmortems. In other words, a postmortem where there’s no finger-pointing at individuals, and instead a focused effort to improve the lapse in process that caused something to break.
Google’s popular SRE book has an excellent section explaining how to write these. But IMO it skimps on explaining why you should want to write them in the first place, so I’m here to babble about that.
Why you should virtually always prefer a blameless postmortem
The obvious reasons
It’s healthier for group morale. Higher morale leads to all sorts of positive emergent culture patterns:
- People feel comfortable communicating bad news to leadership early, without fear of retaliation.
- Developer velocity improves, because engineers feel more protected taking (healthy) risks.
- Everyone’s happier.
And besides, it’s nice to not point fingers. Nobody enjoys a meeting where one person gets chastised on front of their peers — not their peers, nor the presenter, nor the chastisee.
Less obvious reasons you should want blameless postmortems
Acting blamelessly lets you review your processes with more effective scrutiny than had you attributed an outage to a human.
The larger insight I’m ramping up to is that blameless postmortems rarely work in the long term anyways — blaming humans for outages typically results in band-aid fixes that only work until one person forgets about them. Lurking beneath a human-based trigger is typically a complicated process-based root cause.
To hammer this home — blaming a human ensures:
- That person won’t repeat their mistake.
… and that’s about it. If you’re lucky enough that everyone on your team reads your postmortem, and it strikes the fear of ((deity)) in their hearts, nobody else will repeat that mistake either. For a while. Until they forget, or you hire someone new, or an opaque technical change happens, etc. …
Whereas blaming a process (and then improving said process) ensures:
- That category of mistake won’t happen again.
Unfortunately, the drawback here is that improving a process typically costs more effort in the short term than blaming a human. Even the exercise of determining what process to improve may require input from technical stakeholders whose time is valuable. Improvements also cost SWE-time to implement, so you may find yourself making tough tradeoffs between shipping fast and breaking things, versus investing in reliability in order to prevent breaking things. Choosing which to prioritize is difficult, and best decided via either well-sharpened intuition (for teams in scrappy situations) or a checks-and-balances culture between developers and DevOps engineers who can suggest how to manage their own time (for teams in less scrappy situations — also worth noting that this culture is easier to cultivate when issues are blameless, and stakeholders can voice their honest opinions).
Anyways, even if your team doesn’t have the bandwidth to deliver large improvements, it’s still a useful exercise to brainstorm how you might do it blamelessly.
The bottom line
Don’t trick yourself into believing that blaming the person who triggered an outage will prevent that outage from happening again. Best case scenario: it doesn’t, only for a little while.
Instead: blame processes, not people.
Appendix
On laziness causing outages
Outages are bad, and lazy actions often cause outages, so it can be tempting to assume that the root cause of an outage is pure laziness. This is especially true if it seems like laziness:
For example: Person X forgot to run unit tests, even though the launch documentation clearly told them to.
Even in cases where human carelessness seems to be the issue: dig deeper. A lazy human might have been the trigger, yes, but what was the underlying cause? If a human can forget once, they can forget again, especially as your team scales in size or workload.
(Separately: one-off mistakes are a very biased metric to measure someone’s performance. It’s possible Person X is simply working on a riskier project).
In scenarios with a human trigger, it’s still better to dig into what process can be improved. Anything from updating an internal wiki page, to making tests automatic, to rewriting something more serious. Even if you don’t implement these changes, brainstorming them allows you to learn the developer-agnostic cost of reliability, that is, the cost of reliability that doesn’t live only in your current teammates’ brains. Developers aren’t fungible — firing one and hiring another doesn’t mean their shoes can be filled instantly — so doing this exercise lets you understand where the reliability gaps are and make you more resilient to change.
One last note on laziness. I like the quote:
“You do not rise to the level of your goals. You fall to the level of your systems.”
On postmortem culture being “professional” and war room comms being “unprofessional”
Redacting names from a postmortem is typically good, since they aren’t relevant to making future improvements to your processes.
Redacting names in a live war room is… doable, but probably not worth it IMO. Remember: blamelessness is just a means to an end. The larger goal of blamelessness is nurturing a culture in which people feel comfortable chirping up that their code might be the culprit of an outage. Typically when people are confident their code is broken, they’ll speak up — but your job as an incident commander and a postmortem steward is to make folks feel comfortable speaking up before they’re confident. This lets you gather more technical opinions faster, grease the wheels of conversation, and ultimately drive a faster fix.