VictorOps is now Splunk On-Call! Learn More.

Post-Incident Reviews Part Four: A Post-Incident Review Guide and Next Best Steps

Jason Hand September 08, 2017

DevOps Post-Incident Review

Over the course of the last three blog posts on my eBook, Post-Incident Reviews, Learning from Failure for Improved Incident Response, I’ve shared recommended methods for incident management and conveyed how the old ways of conducting post-incident Root Cause Analysis are outdated and ineffective. It’s critical for companies to understand incidents result not from a single root cause, but from a complex interplay of technology and the humans managing it—a socio-technical system that comprises the reality of modern software. If you haven’t read the previous posts in this series, you can find them here:

In this final post in our Post-Incident Reviews blog series, I’ll review guidelines for creating your own post-incident review process while keeping in mind the demise of Root Cause Analysis. I’ll also share how you can structure an environment that establishes every incident as an opportunity for learning and total system improvement. For a more in-depth guide to read and share with your team, download the full eBook here.

Establish and Document the Timeline

The best way to begin the post-incident review process is to examine how tasks, automation, and human interaction restored services. Some of the fundamental questions your team needs to answer include:

  • Who was alerted first? What was the time to acknowledge?
  • Who else was brought in, and at what time?
  • Which tasks had positive, negative, or no impact on restoring service?
  • How much time did it take to recover?
  • Who executed specific tasks?
  • What was the severity level and the total time of each phase?

It may help to visualize the timeline and assign a different shape for different actions taken—by task, automation, or human interaction. Then, use color coding to indicate whether each action had a good, bad, or neutral impact on restoration. With these tips, you can map out each factor’s impact over all three phases of the incident lifecycle. I’ve provided an example below.

This kind of chart will help visualize the time between phases, the impact of each action on the time to solution, and whether time was used inefficiently. It can also assist in future incident management goal-setting (e.g. acknowledge incidents 50 percent faster).

After mapping out your timeline, it’s important to probe deeply into how each member of the incident response team made their decisions. Genuine inquiry in a group setting not only encourages transparency and knowledge sharing, but also allows your team to reflect on the effectiveness of the approach at each incident phase. The entire company will become invested in the process and your team will be better equipped as a result.

Create Action Items

Your discussion should drive action items. For every incident, your team will need to:

  • Identify, prioritize, and assign action items. Assigning specific owners will ensure your tasks don’t sit in the backlog and do nothing for your incident analysis strategy.
  • Prioritize countermeasures and enhancements to the system above all new work.
  • Compile a summary. There are likely others in your organization who want to be informed about the incident. Your high-level summary should include several or all of the following sections: Summary, Services Impacted, Duration, Severity, Customer Impact, Proximate Cause, Resolution, and Countermeasures or Action Items.

Improve Your Incident Readiness

If I leave you with one take-away from this guide, it should be that every incident provides an opportunity for your team to be more prepared for the next one. You should come away from your incident analysis discussions with an enhanced understanding of your systems and a plan to improve. Set targets for small improvements throughout detection, response, and remediation phases, and your team will make learning from failure a natural part of the system lifecycle.

Let your documentation open further dialogue, and encourage transparency through every step of the process. This kind of post-incident review strategy will foster an environment of constant experimentation, learning, and desire for making systems safer. Instead of isolationism to avoid taking responsibility, everyone will become hungry to learn. Your team will inevitably be better equipped to tackle the next incident when it comes.

This completes the Post-Incident Reviews Analysis blog post series. I hope you now appreciate why the old way of retrospective incident analysis has aged poorly—and know if you stick with it, you’ll likely experience some frustration in the near future.

For the complete guide to holding a successful modern post-incident review process, download the full eBook.

Let us help you make on-call suck less.

Get Started Now