Resolving incidents quickly and mitigating downtime is essential when disaster strikes. However, the importance of the post-incident review after an incident occurs can’t be overstated. Strategically planned and implemented post-incident reviews will allow on-call engineers to more effectively manage incidents that occur and minimize future incidents.
This post will act as your go-to guide for walking you through the process of creating helpful, actionable post-incident reviews.
If you’re interested, get yourself a full, downloadable PDF of the VictorOps Post-Incident Review Template.
Who Should Be Involved
The post-incident review should include anyone involved in the incident, affected by the incident, and able to provide constructive, after-the-fact feedback.
Post-incident reviews should always include first responders and escalation responders. If your team has implemented an Incident Commander to lead an on-call team during an incident, they should also be involved in the post-incident review. On occasion, you will also want to loop-in management and stakeholders from other areas of the organization who may have been affected or simply need visibility into the incident. Don’t include people who won’t contribute to or benefit from the post-incident review.
What Are the Goals of Your Post-Incident Review?
The whole point of taking on the post-incident review process is to learn from your mistakes and improve your systems. You’ll want to think about what you’re trying to accomplish at a very granular level in order to assess exactly how you should conduct your post-incident reviews. But, at a high-level, there are two things which every post-incident review should address:
- How do you know sooner (detection)?
- How do you recover sooner (response & remediation)?
You may have detected, responded, and remediated a minor, non-customer affecting issue within seconds. Maybe you feel this doesn’t even warrant a post-incident review. But, there’s always room for improvement. You can learn as much from what you did well as you can from what you didn’t do well. Creating blameless post-incident reviews to answer the above questions and record comprehensive situational details is essential for all incidents—no matter how small.
Essential Post-Incident Review Metrics and Data
Your solutions have to remain data-driven. Recording and presenting the data behind the events of an incident can provide a deeper picture of what happened and how you can fix it in the future. A few of the key metrics and data to include in your post-incident review would be:
- Time to Acknowledge
- Time to Recovery
- Time between each individual incident phase:
- Severity of the issue
- Were your customers affected?
- Were other, intertwined systems impacted by the incident?
- Who was the Incident Command engineer in this particular circumstance?
- Logs for all of the tasks, conversations, and actions that were performed during the incident. Also, who took part in each of these tasks, conversations, or actions?
- Which tasks/conversations/actions were productive
- Which tasks/conversations/actions were counter-productive
- What kind of information was shared during the incident?
- Determine which pieces of information were beneficial to resolving the incident and which pieces of information may have clouded visibility around the issue
For more information and deeper insights into essential metrics, you can download the full, free O’Reilly Media Report on Post-Incident Reviews.
Actionable Tasks and Learnings
Based on the data, what actions can your team take to bolster your system’s reliability and ensure that this incident does not happen again? After the incident, you can record everything you learned and assign tasks accordingly. In addition to documentation, ensuring your team takes firm actions is a core function of not only a successful post-incident review, but also the full incident management lifecycle.
If you think some detail, graph, or note is worth mentioning in the post-incident review, then mention it. Don’t hold back on the notes and information you provide in the post-incident review. Just make sure everything provided is actionable and offers insight into why the incident happened and how you can prevent it in the future.
Don’t forget to get the downloadable PDF of our VictorOps Post-Incident Review Template for yourself!