As we dive further into the eBook, “Post-Incident Reviews, Learning from Failure for Improved Incident Response,” we’ll explore how using analysis as an avenue for learning is the key to developing a successful post-incident plan. If you haven’t read the previous posts in this series, you can find them here:
For a free copy of the full eBook, click here.
Discovering Areas of Improvement
Often times, it’s easy for engineers, product owners, and leadership to focus solely on the current system, while ignoring how implementation gaps or issues created during staging could affect future performance. During post-incident analysis, it’s important to view every stage of the system’s evolution, from backup processes to testing environments.
In addition, decision makers can ensure understanding isn’t limited to the IT department. This aids upper management and senior developers in clearly understanding how communication breakdowns throughout the company can improve for quicker reactions and — as a result — faster resolutions.
Identifying Trade-offs and Shortcomings
Another component of a successful post-incident review is defining where current trade-offs exist, and how procedural or communication-based shortcomings impact the system. Ask questions such as, “How much of this is new information for people in the room?” and “How many of you in the room were aware of all the moving pieces here?” The answers reveal an obvious disconnect from “work as designed versus work as performed.” When constructing a solid method for facilitating remediation efforts, you can also ask:
• What warrants the exercise?
• When should we perform it?
• Who should be there?
• How long will it take?
• What documents should come from the exercise?
• Who should have access to the artifacts created?
Answering these questions ultimately helps determine whether a post-incident analysis is necessary. If so, the next step is defining the incident and determining its lifecycle. Incidents are defined as “any unplanned event or condition that places the system in a negative or undesired state.” From there, break incidents down into levels of severity (which often vary by industry), and priority (which is often segmented as Information, Warning, or Critical). It is then time to evaluate the incident’s lifecycle, in its entirety.
One useful lifecycle model is J. Paul Reed and Kevina Finn-Braun’s Extended Dreyfus Model for Incident Lifecycles. It includes five phases — Detection, Response, Remediation, Analysis, and Readiness — and illustrates where teams might have strong processes in place and where they might be lacking. It is within these phases that teams can establish where an incident originated, how it evolved, and what can be learned to more quickly recognize outages, recover, and shorten the impact of a disruption.
Conducting a Post-Incident Review
After an incident is clearly defined, the final step is organizing a formal review. In addition to the qualities mentioned above, a well executed post-incident review has a clearly stated purpose, a repeatable framework, and can be broken down into the following components:
- Who: all of the people involved in decisions that may have contributed to the problem or recovery efforts
- The facilitator: preferably, a third-party facilitator to reduce bias
- What: the details of the systems, how they behave, and team formation during incident response
- When: immediately following the incident to ensure the details stay top-of-mind
- Where: preferably in-person meeting or virtual, video conference
During your post-incident review, here is a sample framework to leverage as the agenda:
- 1. Establish a timeline of what took place at every stage of the incident.
- 2. Document human interactions of how people interacted and their thought process throughout the challenge.
- 3. Remediation tasks should include commands or actions engineers took to fix the system.
- 4. Utilize ChatOps to record exactly what happened during an incident, in real-time.
- 5. Evaluate metrics that were leveraged during response and remediation.
- 6. Define time to acknowledge (TTA) and time to recover (TTR) to measure how long it takes to recognize a failure and the average time it takes to acknowledge a triggered incident.
- 7. Implement status pages to issue real-time updates regarding the state of a service.
- 8. Define the severity and impact of the incident, as outlined above in Identifying Trade-offs and Shortcomings.
- 9. Determine contributing factors to understand causation and correlation.
- 10. Capture action items when suggestions are made to better detect and recover from similar problems.
And finally, create internal and external reports to share the incident learnings with the team and provide the ongoing benefit to stakeholders outside of the company.
For a complete guide to holding a successful post-incident review process and how it can vastly transform any breakdown into a learning opportunity, download the full eBook and be sure to subscribe to the blog.