Many sources maintain the goal of incident management is simple: to restore service as quickly as possible. When an issue occurs and a system breaks, the knee-jerk reaction is to fix the issue as quickly as possible. So yes, it’s true in everyday practice—the goal when a single incident occurs is to restore service as quickly as possible.
However, the ultimate goal of incident management needs a broader scope—because there is value in doing more than simply reacting to a problem. That’s why we conduct post-incident reviews. Sure, organizations should be concerned about restoring and maintaining services, but they should be more focused on issues not repeating. By setting your incident management goal as process improvement, you’ll reach service uptime improvement as well.
But how do you get there? Improve your post-incident review process.
According to Jason Hand in his book Post-Incident Reviews: Learning from Failure for Improved Incident Response, the purpose of post-incident reviews is twofold:
“There are two main philosophical approaches to both what the analysis is and the value it sets out to provide organizations. For many, its purpose is to document in great detail what took place during the response to an IT problem (self-diagnosis). For others, it is a means to understand the cause of a problem so that fixes can be applied to various aspects of process, technology, and people (self-improvement).”
Jason points out two important philosophical approaches to learning from post-incident reviews: documentation and discovery.
You’ll be able to learn and improve if you keep track of everything during and after an incident. Having a central timeline, some sort of event black box, in your incident management tool, you can collect information coming into and out of the system so you always know what’s going on.
If an outage occurs in your application you’ll have alert information from the monitoring tool that first reported the downtime, and you’ll have conversation history in your Slack instance while your team worked together to fix the problem.
Once everything is resolved and you look back, you may find the issue could’ve been avoided if you had a different deployment process. Or perhaps you’ll find the issue would’ve been resolved faster if your entire team was using the same ChatOps tool instead of jumping around between emails, chat windows, and phone calls. Bringing this information together during your post-incident review will help you improve processes, technologies, and team relationships.
Don’t get tied up in the technical causation of an incident. Look beyond the now and think about what this means to your overall application strategy. Your incidents can point to problems in your processes and technologies. Identifying these trends can ultimately lead to less incidents over time. To help your team lean into DevOps at a faster rate, continuously improve your processes and technologies.
Post-incident reviews make it possible for your team to regularly identify issues in development, operations, security, and more. Don’t skip over these areas because they aren’t directly tied to a specific incident. Use these opportunities to learn and improve at a faster pace.
When you’re in the middle of a firefight, the goal will always be to put out the flames. But once you’ve found a resolution, use your incident management tool to see what happened while you were focused on resolving the issue. Rely on your black box of information during your post-incident review. The Timeline within your incident management tool will help gather what happened across your systems and teams during an incident. You’ll be able to retroactively see all the deploys, code changes, conversations, hiccups, and solutions in one place.
It’s time to step away from the firefight and take a more holistic approach to your incident management process. Tackle incidents with speed and accuracy, but take time to focus on continuous learning through post-incident reviews. The more you learn about your system, your processes, and your team collaborations, the easier it’ll be to handle the next incident at a faster pace. Increasing the speed of innovation while also maintaining stable systems should be your goal.