VictorOps is now Splunk On-Call! Learn More.
As we dive further into the eBook, Post-Incident Reviews, Learning from Failure for Improved Incident Response, we’ll explore how using analysis as an avenue for learning is the key to developing a successful post-incident plan. If you haven’t read the previous posts in this series, you can find them here:
For a free copy of the full eBook, click here.
Often times, it’s easy for engineers, product owners, and leadership to focus solely on the current system while ignoring how implementation gaps or issues created during staging could affect future performance. During post-incident analysis, it’s important to view every stage of the system’s evolution, from backup processes to testing environments.
In addition, decision makers can ensure understanding isn’t limited to the IT department. This aids upper management and senior developers in clearly understanding how communication breakdowns throughout the company can improve for quicker reactions and—as a result—faster resolutions.
Another component of a successful post-incident review is defining where current trade-offs exist and how procedural or communication-based shortcomings impact the system. Ask questions such as, “How much of this is new information for people in the room?” and “How many of you in the room were aware of all the moving pieces here?” The answers reveal an obvious disconnect from “work as designed versus work as performed.” When constructing a solid method for facilitating remediation efforts, you can also ask:
Answering these questions helps determine whether a post-incident analysis is necessary. If so, the next step is defining the incident and determining its lifecycle. Incidents are defined as “any unplanned event or condition that places the system in a negative or undesired state.” From there, break incidents down into levels of severity (which often vary by industry), and priority (which is often segmented as Information, Warning, or Critical). It’s then time to evaluate the incident’s lifecycle—in its entirety.
One useful lifecycle model is J. Paul Reed and Kevina Finn-Braun’s Extended Dreyfus Model for Incident Lifecycles. It includes five phases—Detection, Response, Remediation, Analysis, and Readiness—and illustrates where teams might have strong processes in place and where they might be lacking. It’s within these phases that teams can establish where an incident originated, how it evolved, and what can be learned to more quickly recognize outages, recover, and shorten the impact of a disruption.
After an incident is clearly defined, the final step is organizing a formal review. In addition to the qualities mentioned above, a well executed post-incident review has a clearly stated purpose, a repeatable framework, and can be broken down into the following components:
During your post-incident review, here is a sample framework to leverage as the agenda:
And finally, create internal and external reports to share the incident learnings with the team and provide the ongoing benefit to stakeholders outside of the company.
For a complete guide to holding a successful post-incident review process and how it can vastly transform any breakdown into a learning opportunity, download the full eBook.
Also read the final blog in the post-incident review series: