Reducing MTTA: Escalation

Dan Holloran March 22, 2019

Monitoring & Alerting On-Call
Reducing MTTA: Escalation Blog Banner Image

In an integrated on-call incident management tool, automated and manual escalations built for human workflows become essential. In part 4 of our reducing MTTA series, we’ll talk about how you can create a system for rapid incident response through optimized escalation processes. Being able to quickly reroute an incident and involve the right people and teams from the get-go allows on-call responders to acknowledge and remediate incidents faster.

Watch as I discuss a few helpful tips for managing escalation policies in VictorOps that can create a better on-call experience for everyone involved:

If you haven’t already, don’t forget to check out the rest of the reducing MTTA series (links below):

How to Make On-Call Suck Less

Setting up rotations and escalations

In DevOps and IT, all on-call teams will look a little different. In order to set up effective rotations and escalation policies, you need to achieve a balance between constant coverage and schedules that don’t kill employee morale. These rotations and escalations need to work together in a logical way. It’s a good idea to do a mind-map of the alerting, on-call rotation and escalation workflow just to make sure you’re not exposed to any blind spots.

Luckily, an on-call incident response solution like VictorOps can help you visualize these workflows and identify any coverage issues. Primary, secondary and personal escalation policies create multiple ways in which incidents can be routed quickly and easily between both individual on-call users and teams. This allows flexibility in response and helps teams easily get incidents to the right person – making incident management suck less.

Think about on-call operations in individual shifts and schedules, team rotations and associated escalation policies. Understand how these on-call structures work together and interact alongside your applications and infrastructure to build out escalation policies that address inefficiencies in the process.

Tips for building an escalation policy structure

1) Streamline the on-call teams

Don’t create multiple teams with multiple separate rotations, create one team with multiple escalation policies. This way, you can maintain team schedules and personal on-call shifts in a simple directory while maintaining flexible alert routing and escalations. You can even use this as a method for prioritizing notifications and making sure issues are surfaced to the right people at the right times. In this system, it’s easier to track on-call calendars, both primary and secondary, and ensure you’re not missing any coverage.

2) Leveraging “waiting rooms”

As we briefly mentioned in part three of this series, waiting rooms can be set up in your escalation policies to reduce alert fatigue. You can set time thresholds for the escalation of incidents that are likely to self-resolve. This way, if the issue persists past a certain timeframe, you can alert real people to look into the issue and assess further actions. Waiting rooms can stop on-call teams from being woken up at 4 AM for common incidents that are known to auto-resolve.

3) Improve visibility to secondary on-call schedules

If you’re building a homegrown on-call schedule or maintaining your calendar in a spreadsheet, secondary calendars likely aren’t a thing. In VictorOps, you can set up primary and secondary on-call schedules that are also associated with primary and secondary escalation policies. You can maintain separate escalation policies (i.e. the second step of the on-call notifications) for teams or users who are the backup to the primary rotation. Then, in the calendar view, you can see both the primary and secondary coverage for each team.

4) Take advantage of scheduled overrides

Scheduled overrides are an excellent way to cover escalation policies when users or teams have a planned absence. At the escalation policy level, if you know you’ll be unavailable, you can ask that some other team’s escalation policy be applied to a certain rotation. This way, you can make flexible schedule changes and ensure there’s never a gap in coverage. Scheduled overrides make on-call suck less for individual responders but simultaneously feeds value and efficiency into the on-call process.

Automated and manual escalations

Escalation policies can be both manual or automated. This means you can set rules that will automatically route alerts properly and quiet the noise when necessary but still allows you to manually escalate critical issues when you need to. Offering both manual and automated escalation policies improves workflow transparency and speeds up incident response. With greater flexibility and transparency, on-call responders will be able to acknowledge incidents faster and reroute them to the right people faster. A major reduction in MTTA can simply come from the logical flow of alerts from the system to the people.

Navigate the rest of the reducing MTTA series to continue optimizing on-call operations and improve incident response over time:

Want to know more about leveraging escalation policies and integrated on-call scheduling to reduce MTTA/MTTR? Request a free personalized demo of VictorOps or sign up for a 14-day free trial to see how you can make on-call suck less for your team.

Ready to get started?

Let us help you make on-call suck less.