The Incident Management Handbook

The Incident Management Handbook Blog Banner

Incident management isn’t a straightforward, one-size-fits-all process. Every organization is built upon different infrastructure—technologically, culturally, and personnel-wise. And with the growing popularity of integrated systems and continuous feature delivery, building highly observable, scalable systems is more difficult than ever.

That being said, this post is meant to serve as your Incident Management Handbook—a comprehensive overview of the incident life cycle and how you can holistically approach incident management.

If you take nothing else from this, remember to do anything you can to make current and future incident detection, response, and resolution better for your people.

Incident Management Maturity

Your level of incident management maturity refers to your understanding of your system and how effectively you can diagnose, respond, and resolve problems. We typically define four different levels of incident management maturity:

Reactive

  • Little to no visibility or awareness of your system’s performance
  • Piecemeal system/process for communicating during an outage
  • Undefined personnel roles and/or processes throughout the entire incident life cycle.
  • Lack of organization or tooling for monitoring, alerting, and collaboration which leads to hindered on-call notifications, alert routing, escalation, and incident remediation

Tactical

  • Some organized processes and tooling for monitoring, alerting, and incident response
  • Segmented personnel roles and alert prioritization for better defined incident management
  • Methods for communicating and collaborating around issues
  • Implemented rough policies and procedures for managing an incident throughout the entire life cycle

Integrated

  • Deeper analyses—building off learnings from past incidents via post-incident reviews
  • Triage documentation and runbooks
  • Contextual alerts containing applicable metrics, traces, and logs.
  • Consistent cross-functional, end-to-end collaboration and communication methods and alert routing functionality

Holistic

  • More opportunity for incident self-remediation and actionable alerts to cut down on over-alerting or under-alerting
  • Advanced metrics, thresholds, and alerts
  • Not only making incident management more efficient, but making it better for the people involved
  • Consistent, defined communication methods that optimize workflows, improve incident visibility, and make incident management suck less
  • Ability to continuously learn and improve on the complete incident management process via post-incident reviews, deeper metrics, and a better understanding of your own internal incident collaboration### The Modern Incident Management Life Cycle

Understanding the five stages of the incident life cycle will give you greater insight into building a more robust incident management workflow. No matter what you do, incidents will happen. Detecting, responding, and remediating an incident is no longer enough. Truly holistic incident management means you’re analyzing past incidents and using the information to prepare for future ones, expediting the entire process.

Stage 1: Detection

Too many false alerts + Too many interruptions = Acute Alert Fatigue

Proper monitoring and striking the balance between under-alerting and over-alerting is the key to avoiding alert fatigue and detecting actionable incidents quickly. Time-series databases and visualization tools allow you to monitor your system’s performance in real time. If you’re unable to detect errors or failures quickly, you’re already behind the 8-ball. Setting up monitoring tools, thresholds, and alerts for key pain points of your infrastructure will help you identify issues, sometimes before they even happen.

Detection really boils down to prioritization and visibility. Once you’re able to collect the metrics you need, you can start to visualize the data and share the necessary information with applicable team members. The faster you can detect a problem and understand it, the faster you’ll be able to respond.

Stage 2: Response

A centralized timeline, with an efficient system of incident categorization and prioritization, provides cross-team visibility and gets actionable alerts to the right people at the right time. Not only should the alert be routed to the proper person in a timely manner, but the alert needs to come with actionable context. With the context, you can provide relevant runbooks or triage instructions in order to give the on-call engineer exactly what they need to resolve that specific problem.

Using ChatOps tools and workflows will allow multiple teams and people to collaborate around incidents. In addition to optimizing chat for incident response, you’ll need essential incident management functionality such as dynamic on-call scheduling, team rotations, scheduled overrides, and automated escalation and alert routing functionality. This moves past a simple alerting tool and creates an environment where you’re getting notified, responding to an issue, and collaborating around the problem.

Every second you spend responding to an incident is time that could have been spent improving future reliability and uptime. Smooth incident response workflows will provide your team(s) with more information, make them happier, and improve overall incident response.

Stage 3: Remediation

Your incident management software should act as a single pane of glass of data for anything from current system reliability to new deploys to production. Robust log analytic tools, time-series databases, visualization tools, and ChatOps tools can live in one location and provide ultimate visibility and collaboration. Whoever needs to be involved in remediation can be easily pulled in, with the data, runbooks, and triage instructions they need, to quickly resolve an incident.

In fact, remediation of an incident should be the smallest part of the process. If your team is prepared, knowledgeable of the system as a whole, and given the tools they need for actionable monitoring, alerting, and collaboration—incident remediation will be quick. But, when it comes to remediation, it’s important to pay attention to incident prioritization and make sure that high severity, potentially customer-facing incidents are resolved first.

Stage 4: Analysis

A solid post-incident review will prepare you for future issues and help you build more robust incident management processes. Take note of the timeline throughout an incident. Who was involved? How long did it take to detect the issue? How was the issue detected, responded to, and resolved? Fully understanding the actions taken (on-call response, alert escalations, communication, etc.), the tools and processes that were effective, and exactly how individual sections of the incident management life cycle played out can help you build more robust incident management workflows.

Stage 5: Readiness

All of the previous stages lead up to ultimate readiness. The key is simply to pay attention, and take action. Don’t do things incorrectly, and then refuse to adjust the process because it takes an initial investment of time. Spend time optimizing alert rules, suppressing alert noise, adding applicable alert annotations, and define effective escalation and routing policies. Build, maintain, and collaborate around software that enables your team to seamlessly address incidents and lower MTTA/MTTR.

DevOps-Focused Incident Management Influences SRE

A DevOps culture influences site reliability and improves incident management. A culture of accountability and collaboration is key to SRE and overall incident management readiness. When you’re responsible for maintaining the systems you create, you’ll be more cognizant of building something stable.

All incidents flow through the incident management life cycle, but expediting the process is in your hands.

VictorOps is purpose-built for managing and collaborating around an incident throughout the entire life cycle. Check out our free guide to see how a DevOps culture benefits incident life cycle management.

Ready to get started?

Let us help you make on-call suck less.