At its core, incident response is simple. An alert is sent out according to the rules defined in your monitoring solution, the alert is then routed to the person or team who can fix the issue and the incident is resolved by that person or team. But, there is nuance required all over the board when it comes to your approach to incident management and response. And, as you’ll find out in this article, high levels of collaboration are essential to any successful on-call incident response plan.
DevOps, SRE and on-call teams spend hours tweaking monitoring thresholds, setting up anomaly detection services and working to improve the overall observability of the system. However, incident response and the way people work together during a firefight is often overlooked. But, in our State of On-Call Report we found that, on average, 73% of an incident’s entire lifecycle is spent in the response phase.
So, optimizing incident response should actually be one of the first places you look when trying to improve your overall incident management process. Let’s take a look at how you can go about creating an effective collaborative incident response plan for your team.
Preparing for incident response
Being prepared for an incident is nearly as important as the incident response itself.
In the words of late President Abraham Lincoln, “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”
Through collaborative post-incident reviews, on-call teams can improve their ability to respond to future incidents and bolster overall service resilience.
Preparation causes less anxiety for on-call responders and will give them the tools, annotations and alert context they need to quickly remediate the issue. Building out a plan for outage communication and the execution of incident response, for both internal stakeholders and customers, leads to efficient workflows and rapid incident response – but mainly, on-call that doesn’t suck for your employees.
In order to strengthen your incident response efforts, you need to understand how incident response fits into the greater incident lifecycle. So, let’s take a peek at the different phases of the incident lifecycle and how improved incident response feeds into the efficiency of the rest of the lifecycle.
Steps of the incident lifecycle
What’s the best way to detect and notify on-call responders to an issue? Are your monitoring solutions properly setup to alert on application and infrastructure concerns? Are they optimized to limit alert fatigue while still notifying the team to major incidents?
Constantly looking for new ways to detect incidents and surface alert context faster is imperative to DevOps and IT success. So, incident detection capabilities and the overall observability of your architecture should remain top-of-mind as the team continues to build out new features and services.
The response phase encompasses the actual acknowledgment of an alert, any routing and escalating of the alert and the ensuing firefight. High levels of communication and the sharing of real-time alert context are imperative to successful incident response. Whether you’re working on-call with a distributed team or working together locally, people need a way to collaborate in real-time during an incident.
Because the response phase typically takes up so much time, any possible way to shorten the response phase should be welcomed – whether it’s via improved alert routing, getting people on conference calls faster, or automatically surfacing runbooks, logs and charts.
The actual remediation of the incident is typically pretty quick. If incident response has been prioritized and monitoring and alerting thresholds are well-adjusted, the on-call responder will have the information they need to rapidly fix the issue. Whether they need to restart a server or rollback a deployment, they should be able to rapidly assess the situation and determine the course of action. And, if runbooks or playbooks are attached to the alert, the instructions for resolving the incident are sometimes even served on a silver platter.
The first three steps are important to resolving the incident, reducing downtime and mitigating any customer impact. But, keeping a record of the communication, workflows and machine data before and during the incident will help you analyze the incident after the fact. At this point, the team will convene for a post-incident review to understand what worked and what didn’t work during incident detection, response and remediation.
With a detailed, collaborative analysis, the team can constantly improve on-call incident management processes and alerting to maintain robust infrastructure and applications while avoiding over-alerting and keeping up team morale. Although it can take time away from writing code in the short-term, post-incident analysis will actually shorten the incident lifecycle over time and give back more development time.
After analysis, you simply need to ensure the team is ready for an outage or incident when it happens. Does the team have the tools they need to accurately identify an incident, respond accordingly and collaborate to find a solution? Can you provide better annotations or transform an alert as it lands on a responder’s plate? Through process and tooling improvements, you can ensure the team is more prepared when an incident hits.
Additionally, proactive stress testing and chaos experiments in your applications and infrastructure can help DevOps and IT teams identify potential issues before they happen. Proactive experiments lead to more robust infrastructure and help the team better understand systems running in production. With deeper exposure to code and the continuous improvement of processes and tooling, teams are more prepared to take on incident management.
Establishing a plan for on-call incident response and outage communication
Now that you know why real-time response, collaboration and analysis is so important in incident management, you can start building a plan. The basic four-step incident response plan includes the following:
Of course, the first step is to make sure the on-call person or team is notified of an issue when it occurs. Figure out how the team will be notified, when they’ll be notified and adjust monitoring and alerting tools accordingly. Is it better to funnel alerts to one incident commander who can route them appropriately or break down the teams and automatically route alerts based on discipline or service?
Because every team is different, you’ll need to customize the notifications and the processes behind them to fit your specific team. Notifications should be served immediately, should be highly visible, and should provide as much context as possible in order to surface helpful incident details quickly. Continuous improvement of incident management processes and notification methods will help your team maintain a high velocity of reliable deployments.
The better the notifications, the easier the triaging process. Once the notification has been served, the on-call engineer can either start working on the incident or re-route the alert to the person or team who needs to work on it. With improved instructions and more contextual notifications, the first on-call responder can better diagnose the problem and get the right people involved faster.
Triaging the incident is actually pretty straightforward when alerts already have logs, charts and runbooks attached. By integrating your communication tools with monitoring and alerting software in a single-pane-of-glass, your team can more quickly identify the best path for incident remediation.
Once the on-call person or team has an understanding of the core issue, the team can start working to mitigate the problem. The on-call team needs to first limit any customer impact ASAP and basically “patch up” the system. Any kind of outage or downtime costs a lot in lost opportunity and revenue, and not to mention the negative effects of a poor customer experience.
Getting the system stabilized is the first priority at this stage. Engineers or IT professionals should not yet be looking for the root cause of the incident. Once the incident has been contained and the service has been restored for customers, the team can start looking at a full-fledged resolution.
At this point, the on-call engineers can start to review the incident, identify the systemic root cause for the incident and ensure that incident management KPIs look good and that system health has returned to normal.
For most teams, once the original indicator of the incident has returned to normal and the system is functional again, the incident can be marked as resolved. But, even after the incident has been resolved, it’s important that you take steps to ensure the incident doesn’t happen again or that applicable runbooks or monitoring thresholds are set to reduce the likelihood of this incident’s recurrence.
Collaboration takes center stage
As you can see, a team that works well together builds more reliable systems together. Real-time, optimized collaboration in incident response is the key to efficiency when approaching the four steps of an incident response plan. With a centralized solution and framework for monitoring, alerting and communication, there’s increased incident visibility and collaboration across development, DevOps and IT teams – leading to rapid incident detection, resolution and analysis.
Build out your own collaborative on-call incident response plan within a purpose-built solution for DevOps and IT teams. Sign up for a 14-day free trial of VictorOps to start reducing alert fatigue, surfacing context faster and making on-call suck less.