VictorOps is now Splunk On-Call! Learn More.
In a world of highly-integrated systems, microservices, cloud infrastructure and constant development, DevOps and IT teams are tasked with finding better ways to keep up with their own processes. By actively testing throughout the development lifecycle and preparing for incident response, you’ll build more resilient services up front while simultaneously being prepared when things go south.
In DevOps and IT, efficient incident response relies on a proactive approach to on-call schedules, monitoring, alert routing and escalation policies. Integrating alerting, on-call scheduling and escalation methods into one plan will lead to a holistic, humane on-call experience. At the end of the day, incident response is dedicated to lowering mean time to acknowledge and resolve (MTTA/MTTR) incidents through the optimization of collaboration between people, processes and tools.
Incident response is so much more than the tools and processes you implement – it’s how you use those tools and processes to take action. So, in this post, we wanted to first focus on escalation. Purpose-built escalation policies allow engineering and IT teams to get the right people involved at the right times.
We’re going to look at some ways you can build a template for escalation processes, silence unactionable alert noise through escalations and, ultimately, make on-call suck less.
The nature of distributed systems and teams has given way to more cross-functional engineering and IT roles. People are moving into more SRE and DevOps roles and gaining skills in both software development and IT. Development teams are taking ownership for the code they write by taking on-call shifts and staying involved through all stages of release management. The job responsibilities for development, customer support and IT operations are overlapping more and more – leading to changes in the way teams deploy and maintain systems in production.
Because of the changes to workflow visibility, team structure and the software delivery lifecycle, incident management and response processes are changing. DevOps-oriented teams are forming, focusing on building cross-functional, collaborative groups of people who can better maintain a CI/CD pipeline and fix issues faster. The makers of applications and services are now being held accountable – teams are no longer allowing developers to simply write code, make IT ship it and force customer support to deal with the repercussions.
So, as team dynamics change, escalation processes do too. Whether your team is using ITIL principles and a NOC, or you’re a fully built-out DevOps structure, the team needs to know how alerts come into the system and how they should be escalated.
In traditional IT, a NOC would leverage a pretty straightforward call center escalation process – every alert gets routed to level one support where they decide if it needs to get escalated to level two support. But, after that, nearly every team approaches this process differently. Is level two support the final level of support? If so, you’ll need a broad scope of expertise and maybe even multiple developers and sysadmins on the support team in order to effectively remediate incidents.
Also, does your team want to maintain a singular escalation path or does the team want to use multiple escalation paths? Are escalations based on individual services or are they based on people and teams? How often should you automate escalations and what should your timeframe be between escalations? When are manual escalations necessary? Because every team is built differently and their applications and infrastructure are built differently – there’s no right answer to these questions. But, they’re important questions to think about when defining SLAs and SLOs.
Because of improvements to workflow transparency and collaboration, DevOps teams are normally better at identifying the right person or team for incident escalation. And many times, because of the DevOps team’s involvement throughout all of the software development lifecycle, incidents are often served to a person who has enough knowledge to resolve the issue. And, with alert automation and context from monitoring data, alongside multiple escalation policies, teams can often automatically serve the right alert to the right person at the right time.
No matter how you’ve set up on-call teams or escalated issues in the past, you can use this post as a template for building your own escalation plans. If you ever have trouble figuring out how escalations should look at your own organization, try making a value stream map of your escalation paths – it will help you identify blind spots and areas for improvement.
Before you can start establishing escalation workflows, you need to establish how your on-call teams are broken down. For organizational purposes, and to avoid confusion, you’ll want to limit the number of teams you establish for on-call operations. If you’re using an on-call scheduling and alerting tool like VictorOps, it becomes even more important. You should break things down by frontend, data, IT operations, etc. But, maintaining visibility and setting escalation paths and schedules for the teams is a lot easier if you don’t break it down into even smaller teams (e.g. IT operations - applications vs. IT operations - network). With a larger team broken down by escalation policies, you can easily see the entire calendar for the whole team while still quickly seeing the escalation paths for individual subsects of that business unit.
Whether using automated escalations, manual escalations or a combination of the two, visibility into all on-call schedules will make incident response drastically more efficient. If an alert comes into the system and the initial on-call responder doesn’t know where to escalate the issue, they can look at the secondary on-call calendar for that team and see who’s on-call. Then, they can directly notify that person via SMS, phone call, email, Slack, etc. and start working on the issue together – almost immediately. Give people a place to see exactly who’s doing what and when they can escalate issues to certain individuals or teams.
Because the people maintaining your systems are – well – people, they’re going to take time off from work. Whether it’s a long-term vacation or simply an hour or two for a kid’s soccer game, escalation policies need to be as flexible as the schedules behind them. By allowing for scheduled escalation policy overrides, you’ll ensure someone is taking on-call responsibilities in case you’re away from the computer for a while. Ask your team what kind of flexibility they need for on-call rotations and escalations. Acknowledging the fact that teammates have lives outside of work will help you find ways to ensure coverage without overly-disrupting employees’ lives.
Waiting rooms are a great way to silence self-healing or unactionable alerts with escalation policies and routing keys. Basically, you can set up a secondary escalation policy that’s used by any alerts coming into your system that don’t require immediate action. Then, you can designate the amount of time that an issue should sit in the waiting room before it gets escalated/rerouted to a team or person who needs to actually dig into the issue. It’s important that you don’t get critical issues sent into waiting rooms, but it’s a great way to silence unactionable notifications in the middle of the night.
And, last but not least, make sure you track pertinent incident management and response metrics. By tracking and lowering MTTA and MTTR, you’ll ensure on-call engineers are responding quickly to issues in production. Also, by keeping a close eye on MTTA and on-call reports, you can make sure on-call engineers aren’t ignoring alerts and simply waiting for escalation policies to pass issues along to a new person or team. Over time, you can use MTTA and MTTR as high-level measurements to show that incident response is getting better or worse – then you can take action to constantly improve the way you approach incident response and remediation.
The thoughtful development of on-call calendars, alert rules and escalation policies will save you time and stress. It takes a lot of time to set everything up on the front-end but it will save you time in the long run. Silencing unactionable alerts and getting notifications to the right people at the right time allows the team to spend more time developing new services and driving customer value.
Give your DevOps and IT teams the autonomy to build out efficient, flexible alerting and escalation policies. Then, track incident response metrics to ensure you’re continuously improving on the speed and effectiveness of your incident management processes. Setting up on-call calendars and escalations seems like a daunting task at first – but it’s totally worth it in the end.
In addition to escalations, we discuss all of the ways you can make on-call incident management suck less in our Incident Management Buyer’s Guide. Get your free copy of the guide today.