VictorOps is now Splunk On-Call! Learn More.
Modern Agile practices and DevOps methodologies are leading to faster feature releases even though systems are becoming more complex. With high velocity comes more change and more change leads to more alerts and incidents in applications and infrastructure. So, the only surefire way for DevOps and IT teams to build reliable services is through proactive testing and an efficient on-call incident response plan. So, we thought we’d lay out everything in our complete on-call incident response template.
In on-call operations and incident management, speed and accuracy are essential for maintaining uptime and mitigating the costs of downtime. But, ensuring service reliability and being on-call doesn’t have to suck. Developers and IT professionals alike are learning to work better together in a DevOps environment and share accountability for the overall resilience of the services they build. Through flexibility, transparency and better collaboration, on-call responders are reducing on-call stress and anxiety.
The on-call incident response template covers each step of the incident lifecycle and how teams are ensuring system reliability without harming employee morale. The template will help DevOps and IT teams develop a complete plan for identifying incidents, notifying the right people, and reducing the mean time to acknowledge and resolve (MTTA/MTTR) incidents over time.
First and foremost, engineering and IT organizations should always put the end-user first. Technical support help desks exist to respond quickly to customer problems and find ways to fix those issues. Many times, issues need to be escalated from customer support to disparate IT and development teams. So, help desk communications need to be tightly integrated with internal IT service desks and other development teams.
Many times, customer support agents and IT service desks will use the same software in order to improve the speed and accuracy of escalations and internal collaboration. Any way to improve visibility between the help desk, the service desk and the engineering organization will drastically improve efficiency and collaboration throughout all of on-call incident response. A working agreement between help desks, service desks and engineering will ensure everyone understands their priorities and responsibilities in the greater incident response plan.
Once you’ve assessed the general outline for your incident response plan, you’ll need to define who’s on-call. Which teams and specific users should be on-call? And, which services should they be on-call for? The way you set up on-call teams may not be the same as how you set up overall engineering and IT teams. On-call teams can be set up based on alert severity, monitoring tool, underlying service, or engineering discipline – whichever works best for your team.
Think about the best way to handle manual incidents and automated incidents generated by monitoring tools. Do you need multiple on-call teams for each? Or, should manual and automated incidents be handled the same way based on alert source and escalation policies? Unfortunately, because every service is different, there’s no single way to set up efficient on-call incident response. Understanding the way teams work together alongside technical services and infrastructure can help you determine the best way to set up on-call teams and users.
After you’ve established organized on-call users and teams, you need to establish a permissions hierarchy. Who should be an administrator and have power to change teams, users, schedules, etc.? Also, who should be a simple power user that needs to be on-call and use the platform, but doesn’t need to be weighed down with platform administration tasks?
A hierarchy of user permissions allows users to focus on real-time incident response, not organizational tasks. Intelligent permissions also allow for stakeholder visibility into on-call operations without putting people into rotations.
With users and teams organized and given proper permissions, you can start adding on-call schedules and rotations. You can assign schedules and rotations based on team or by the underlying service. Schedules and rotations allow you to think strategically about on-call and ensure there are no gaps in incident response.
Standardizing on-call schedules is actually the key to flexible, efficient incident response. If you create a template for on-call schedules, you can easily rotate users in and out in case of planned or unplanned absences. By assigning alerts to be routed through specific on-call schedules and escalation policies, not by specific users, you can make rapid changes to schedules without dropping coverage. In one single tool for on-call incident management, you can also see who else is on-call and easily switch shifts – improving the flexibility and visibility of on-call operations.
But, what happens if an alert doesn’t reach the right on-call rotation? What if the initial on-call responder is driving? How is the team supposed to escalate the issue or reroute the problem to a different person or team? Backing up on-call rotations with primary and secondary escalation policies will ensure that alerts never go unacknowledged. And, you can leverage time-based automation to automatically escalate issues if they haven’t been acknowledged for a given length of time.
The team needs to understand the acceptable process for escalating incidents and how to use automation to make incident response better. Escalations aren’t an excuse for ignoring alerts and allowing them to go to your manager or teammate. Thorough on-call reporting is required for managers to determine who’s responding to frequent issues and who’s often allowing incidents to be escalated. This creates accountability for incident response while simultaneously helping managers reorganize on-call schedules and escalation policies to help employees avoid alert fatigue and burnout.
Without the right monitoring metrics and logs, you can’t adequately alert on-call teams. What are the third-party dependencies of your system? Which areas of your applications and infrastructure are vulnerable to security or reliability concerns? What tools or techniques can you use to monitor these problems? Metrics, logs and traces should be used to monitor systems in production and create observability into performance and service availability.
A combination of synthetic monitoring and real-user monitoring can be used to proactively address resilience concerns. Then, with proper alert thresholds applied to metrics, logs and traces, teams are able to put incident context directly into the on-call responder’s hands. It’s important for resilient DevOps and IT teams to identify gaps in visibility and implement monitoring tools to fix that.
Now that you’re watching for the right things, you need to know the best ways to notify on-call responders. Integrated monitoring and alerting means leveraging an alert rules engine and automation to surface incident context to the right responders immediately. If a server’s disk utilization spikes, will the monitoring tool automatically trigger an alert and is that alert automatically routed through escalation policies and alert rules to the proper on-call person or team?
Constant tweaking of alert rules and implementation of automation throughout the monitoring and alerting process will lead to highly efficient incident management workflows. DevOps and IT teams can automatically attach metrics, logs, traces and charts to alerts that are automatically served to the correct person. And, if they aren’t directed properly the first time, you can easily reroute the notification to other teams or escalation policies.
Bringing automation, context and collaboration into one single tool makes on-call suck less. Now that alerts are more often served automatically, with context, to the right person, the on-call responder needs efficient ways to collaborate with affected parties. With chat applications like Slack integrated with ChatOps tools like Hubot, on-call responders can communicate with other people on the team and execute commands directly from their chat tools.
A single source of truth improves the visibility and collaboration between on-call teams in a major firefight. And, after the fact, engineering teams have a more holistic view of everything that took place to remediate the incident. Alongside detailed context from the monitoring tools connected to alerting systems, on-call engineers know exactly what’s going on and how they can quickly communicate about the problem.
Those previous steps mostly take care of the real-time incident response process. But, improved incident response relies heavily on learning from the past. So, DevOps and IT teams are tracking important incident management KPIs and reporting on them to improve over time. Reporting can show the frequency of incidents from certain services, the speed of incident response and overall on-call efficiency.
Reporting can lead to better runbooks and incident documentation – giving on-call responders exact directions for resolving recurring incidents. Constant reporting and updates to documentation ensure that you’re taking action on everything learned from past incidents.
Every major incident requires a post-incident review. The involved team needs to come together and discuss what happened and how they can improve. Deeper incident investigations lead to better insights. Not only should you address the technical application and infrastructure issues leading to an incident, but you should also look at the incident response process and the people involved. What worked well during the firefight? What was missing that would’ve been helpful? Use post-incident reviews as a collaborative way to openly discuss reliability concerns without fear of blame.
Every incident is a learning opportunity. Continuously improve the way you deliver services to production and how you approach incident monitoring and alerting. Then, prepare your on-call team with the tools and knowledge they need to effectively tackle real-time major incident response. Continuous improvement of people, processes and technology is the only true way to proactively build resilient services and prepare for on-call operations.
Take this template for on-call incident response and start applying it to your team. Approach each step as if it were part of a checklist and rate yourself from 1-10 at how well you’re executing each step of the incident lifecycle. Write down action items you can take under each section to improve the efficiency of on-call incident management. Then, apply those solutions and take your first steps toward continuous improvement of on-call operations.
Make on-call suck less with a centralized tool for on-call scheduling, alert automation and real-time collaboration. Sign up for a 14-day free trial or request a free personalized demo of VictorOps to start using a comprehensive incident management plan that works.