I mentioned in a previous blog post that one of the topics that came up in the Outages open space talk during DevOpsDays Silicon Valley, and something that I found myself hearing time and time again, was post-mortems, referring to a post-mortem report or a project post-mortem template with deliverables regarding outages.
Outages are going to happen and most major tech companies have tools in place to help alert the right people, provide the relevant information to diagnose the problem quickly, and then collaborate with your team to resolve the issue or outage as quickly as possible. No matter what tools you are using or methods you have in place for providing the post-mortem, it’s essential for improving systems and services to effectively report on and discuss the details of an outage or critical problem.
Additionally, it’s important to wade through the human side and emotional connections to outages. The term blameless post-mortems has popped up a number of times in conversations and gained a lot of traction from Etsy’s adoption of it.
Empathy and lack of blame are points touched on quite heavily in a book I just finished titled The Human Side of Postmortems – Managing Stress & Cognitive Biases by David Zwieback. The author (and most everyone I spoke with at DevOpsDays) believes very simply…
Your organization must continually affirm that individuals are NEVER the “root cause” of outages.
Post-mortem reporting is an entire subject on its own but the idea of a blameless post-mortem really gets the gears turning for those who are just starting to think about how to manage the post-mortem process.
The whole idea of blameless can be counterintuitive to many. Infrastructure engineers may be quick to take responsibility for a failure based on an action they took. Likewise, it’s often easy to point fingers and put the blame on someone. This type of behavior skips right over the investigation of the problem and straight to an incorrect conclusion. It prevents teams and organizations from digging deeper into the outage research in order to identify the real situation that enabled the failure. And until you understand the true cause and condition of a failure, it’s likely a similar outage will occur again over time. If that’s the case, your post-mortem wasn’t all that effective.
Conversely, it’s always possible that a team member had a hand in causing an outage, so it’s better to ask not “What did you do?” but rather “What did you learn?”. We are all learning and growing. If we’re going to be a team and build velocity (whichever definition you prefer), the blaming must stop so teams can improve.
At the end of the day, while it’s easy to say “Jason caused the failure by (accidentally) deleting an entire cluster of servers”, it’s far more productive to skip the blaming and ask “Why was it possible for ANYONE to delete the cluster?”.
UPDATE: Since writing this, I’ve had the opportunity to give a presentation on this very topic. You can see my slides below.