Incident management isn’t a straightforward, one-size-fits-all process. Every organization is built upon different infrastructure—technologically, culturally, and personnel-wise. And with the growing popularity of integrated systems and continuous feature delivery, building highly observable, scalable systems is more difficult than ever.
That being said, this post is meant to serve as your Incident Management Handbook—a comprehensive overview of the incident life cycle and how you can holistically approach incident management.
If you take nothing else from this, remember to do anything you can to make current and future incident detection, response, and resolution better for your people.
Your level of incident management maturity refers to your understanding of your system and how effectively you can diagnose, respond, and resolve problems. We typically define four different levels of incident management maturity:
Understanding the five stages of the incident life cycle will give you greater insight into building a more robust incident management workflow. No matter what you do, incidents will happen. Detecting, responding, and remediating an incident is no longer enough. Truly holistic incident management means you’re analyzing past incidents and using the information to prepare for future ones, expediting the entire process.
Too many false alerts + Too many interruptions = Acute Alert Fatigue
Proper monitoring and striking the balance between under-alerting and over-alerting is the key to avoiding alert fatigue and detecting actionable incidents quickly. Time-series databases and visualization tools allow you to monitor your system’s performance in real time. If you’re unable to detect errors or failures quickly, you’re already behind the 8-ball. Setting up monitoring tools, thresholds, and alerts for key pain points of your infrastructure will help you identify issues, sometimes before they even happen.
Detection really boils down to prioritization and visibility. Once you’re able to collect the metrics you need, you can start to visualize the data and share the necessary information with applicable team members. The faster you can detect a problem and understand it, the faster you’ll be able to respond.
A centralized timeline, with an efficient system of incident categorization and prioritization, provides cross-team visibility and gets actionable alerts to the right people at the right time. Not only should the alert be routed to the proper person in a timely manner, but the alert needs to come with actionable context. With the context, you can provide relevant runbooks or triage instructions in order to give the on-call engineer exactly what they need to resolve that specific problem.
Using ChatOps tools and workflows will allow multiple teams and people to collaborate around incidents. In addition to optimizing chat for incident response, you’ll need essential incident management functionality such as dynamic on-call scheduling, team rotations, scheduled overrides, and automated escalation and alert routing functionality. This moves past a simple alerting tool and creates an environment where you’re getting notified, responding to an issue, and collaborating around the problem.
Every second you spend responding to an incident is time that could have been spent improving future reliability and uptime. Smooth incident response workflows will provide your team(s) with more information, make them happier, and improve overall incident response.
Your incident management software should act as a single pane of glass of data for anything from current system reliability to new deploys to production. Robust log analytic tools, time-series databases, visualization tools, and ChatOps tools can live in one location and provide ultimate visibility and collaboration. Whoever needs to be involved in remediation can be easily pulled in, with the data, runbooks, and triage instructions they need, to quickly resolve an incident.
In fact, remediation of an incident should be the smallest part of the process. If your team is prepared, knowledgeable of the system as a whole, and given the tools they need for actionable monitoring, alerting, and collaboration—incident remediation will be quick. But, when it comes to remediation, it’s important to pay attention to incident prioritization and make sure that high severity, potentially customer-facing incidents are resolved first.
A solid post-incident review will prepare you for future issues and help you build more robust incident management processes. Take note of the timeline throughout an incident. Who was involved? How long did it take to detect the issue? How was the issue detected, responded to, and resolved? Fully understanding the actions taken (on-call response, alert escalations, communication, etc.), the tools and processes that were effective, and exactly how individual sections of the incident management life cycle played out can help you build more robust incident management workflows.
All of the previous stages lead up to ultimate readiness. The key is simply to pay attention, and take action. Don’t do things incorrectly, and then refuse to adjust the process because it takes an initial investment of time. Spend time optimizing alert rules, suppressing alert noise, adding applicable alert annotations, and define effective escalation and routing policies. Build, maintain, and collaborate around software that enables your team to seamlessly address incidents and lower MTTA/MTTR.
A DevOps culture influences site reliability and improves incident management. A culture of accountability and collaboration is key to SRE and overall incident management readiness. When you’re responsible for maintaining the systems you create, you’ll be more cognizant of building something stable.
All incidents flow through the incident management life cycle, but expediting the process is in your hands.
VictorOps is purpose-built for managing and collaborating around an incident throughout the entire life cycle. Check out our free guide to see how a DevOps culture benefits incident life cycle management.