VictorOps is now Splunk On-Call! Learn More.
Stuff breaks, it’s inevitable. People make mistakes, technology breaks down, and processes aren’t infallible. But, when incidents happen, what can we do about it? What can we learn?
As with all things, learning isn’t a binary action, it’s a process. And, when an incident occurs, organizations typically conduct a post-mortem analysis and generate a post-incident review to uncover what went wrong and why. These reports are critical for identifying where our processes and guardrails break down so we can learn how to do better next time.
When it comes to incident response, committing to learn from it is half the battle. The other half is setting up a successful process for continuous education by creating accurate and helpful post-mortem incident reports. While the goal of these reports is to provide you with the information you need to grow, there are a few things you should (and shouldn’t) do to ensure you’re getting the most out of them.
We’ve all heard about “blameless post-mortems.” But, what does it really mean to be “blameless” in DevOps and IT? While it doesn’t mean there are no consequences for malicious actions, a blameless culture recognizes that everyone makes mistakes and that consequences without context will de-emphasize learning and continuous improvement over time.
When creating a post-incident review, it’s critical to avoid assigning blame to any one person. Instead, focus on where the process broke down (or where more process is needed). It doesn’t matter if Joe pushed the button or Jacob wrote the function – those actions may have contributed to the incident but the true failure is almost always a lack of checks and balances along the way.
A good post-mortem report should avoid pointing fingers, everyone involved should take responsibility for both the process and the incident itself. In order to foster a blameless culture, it’s important to emphasize the fact that everyone owns the quality process. When a problem that “you caused” gets re-framed into a problem that “we own,” it allows engineers to focus more clearly on what can be done to make things better rather than waste time trying to deflect blame.
When should you perform a post-mortem? If your answer to that question wasn’t “immediately,” then you’re not doing them soon enough. While it may sound counterintuitive, you should always create a post-mortem incident report while the proverbial incident iron is still hot. This way, the incident is treated with an appropriate level of criticality and all of the details are still fresh in everyone’s minds. If you wait too long to recap and evaluate, details will be missed, passion will be low and the need to improve the process won’t feel so urgent. If you’re going to conduct a post-incident review, do it now or don’t do it at all.
When it comes to post-mortems, data is like time – it’s better to have too much and not need it than not enough and need more. Start by establishing a timeline of what exactly happened and then flesh it out with as much detail as possible. This should include the sequence of events that led up to the incident, how and when the incident itself occurred, who was impacted, how many support cases were generated, who responded to the incident, how quickly, what the response team did to resolve the incident and anything else you can think of. This post-mortem information will be invaluable in determining the root cause and will make identifying areas for improvement significantly easier.
Details, details, details. When creating a post-mortem report, don’t be vague. While the minutiae may seem unimportant, these details can be crucial to root cause analysis. More importantly, by putting as much detail as possible in the report, you eliminate the need to regroup unnecessarily with the incident response team, ensuring that learnings can be extracted directly from the report itself.
While “we” should always take responsibility, it’s important to identify who owns any action items that come out of the post-incident review. As the saying goes, “if everyone owns it, nobody owns it.” By defining clear owners for action items, you ensure any work that needs to get done as a result of the incident report has a person accountable for it.
Engineers like solving problems. Unfortunately, a post-mortem isn’t always the time to do that. Don’t get me wrong, the purpose of a post-mortem is to identify what went wrong and how we can prevent it from happening in the future but it’s not always the time to get into the weeds when it comes to technical problems. In some cases, a post-mortem is a great place to do a root cause analysis. However, some problems have nuanced technical causes that require deeper investigation. While many of us may want to dig into these problems right away, they should be identified as action items coming out of a post-incident review rather than being allowed to distract from the goal of the entire process.
The best way to maintain focus when creating a post-mortem incident report is to use a fixed template. When time is critical, it’s important not to waste it by performing a post-mortem without an agenda. A fixed template can be used for every post-mortem at your organization to ensure the timeline, discovery, ownership and focus all follow a consistent pattern. Templates can, in turn, allow you to perform post-mortems on your post-mortem workflows to further improve the overall process over time.
Incidents happen. But, that doesn’t mean we can’t do something about them. By defining what a high-quality and clearly defined post-mortem incident report looks like, you can ensure the lessons learned with each incident aren’t lost. Eliminating blame, taking responsibility, gathering information, and focusing on tangible outcomes are all critical steps towards building a failure-tolerant culture that can learn from its mistakes.
Make on-call suck less and conduct better post-incident reviews with VictorOps. Try out our 90-day, extended Enterprise free trial today to see how automation, collaboration and transparency come together to make for more efficient incident response.
Zachary Flower (@zachflower) is a principal engineer at Automox — a Boulder-based patch management startup — and freelance writer. With a passion for simplicity and usability within the development pipeline, Zach puts a strong emphasis on the importance of documentation, developer productivity, and shift-left testing strategies.