World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
Being on-call doesn’t have to be so bad. Effective incident management processes and tools help your team collaborate and will lead to less confusion and panic when you’re on-call. Effectively managing on-call scheduling systems, and integrating them with your other monitoring, alerting, and collaboration tools creates a holistic incident management solution–driving greater visibility for your entire team and leading to faster incident remediation.
But, managing on-call scheduling is more than managing systems–it’s also about managing people. Let’s walk through some essential functionality for on-call scheduling tools, as well as some helpful tips for managing your on-call responders and their schedules.
Most homegrown on-call systems simply won’t be as effective. Creating an informative, useful on-call response system will drive faster incident resolution. Purpose-built on-call incident management tools combine the robust functionality you need–while not taking developer’s time
Your on-call system should provide helpful alert context to your team and help them rapidly communicate throughout the entire incident lifecycle. On-call schedules in a centralized platform in-line with your monitoring, alerting, and communication tools creates a highly effective platform for incident response and remediation.
So, let’s take a peek at some specific functionality in a tool that makes on-call suck a whole lot less.
However you decide to set up your on-call incident response, you need to have the basic capabilities for managing on-call schedules and calendars. It’s important that these schedules are highly visible, making it so teammates can see who’s on-call and when. The ability for managers to easily build and change on-call calendars is equally important to teammates being able to view their schedules and see when they’re on-call.
Assigned teams and rotations need to be set in your on-call tool. This way, you can give multiple teams visibility to what may be happening in your system and assign rotations accordingly. Rotations are a recurring schedule of different individuals from a team who take turns being on-duty for a given timeframe. By adding teams and rotations, you can further adjust on-call schedules by individuals, teams, and specific timeframes.
Set and define multiple escalation policies, allowing teams to customize the way alerts are escalated within the system. In association with the on-call schedules and rotations, this can ensure the correct person is getting alerts when they need to. You should be able to view on-call schedules for each escalation policy, easily reroute incidents, or create manual incidents and fire them off through a specific escalation policy.
Being able to simply take on-call duties from a teammate provides flexibility and makes sure you always have coverage, even with short notice of an absence. Allowing your teammates to easily take another team member’s on-call shift makes sure your people can take a sick day, or are able to make it to their son or daughter’s basketball tournament, maybe attend the concert they’ve been eagerly awaiting.
Scheduled overrides are similar to take on-call functionality, but they allow your users to request on-call coverage for longer planned absences. If you’re going to Jamaica for two weeks, you’ll need someone to make sure your application or service is healthy while you’re away. Scheduled overrides should be controlled by the manager of your on-call system, ensuring they’re strategically assigned to users when necessary.
Another great way to make on-call suck less is to allow on-call responders to customize the way they receive notifications based on time of day and day of the week. You should be able to ensure they have multiple notification methods, but let them choose which method applies at specific times. Alerts should be able to come through via SMS, phone call, a chat tool (i.e. Slack or HipChat), or an incident management mobile app–whichever method the on-all responder prefers. Custom paging policies allow members of your team to handle on-call responsibilities their way.
As much as DevOps teams want to automate incident response and remediation, people still need to ultimately fix the problems. Automation can help you optimize on-call schedules, alert routing rules, and escalation rules, but understanding your teams workflow is the most essential piece of building an on-call system.
Ask your team what you can do to help them. What are they running into while they’re on-call that’s slowing down incident response or resolution? Find these on-call pain points, then establish tools or workflows to improve team collaboration or help people find the information they need faster. Being on-call sucks for DevOps teams when they feel they’ve been left out in the cold.
Don’t tell your team how to handle on-call, let them tell you. Listen to your on-call team–it’s by far the best way to identify ways to improve on-call incident response and build more reliable products for your customers.
If you can, schedule on-call teams and rotations based on geography. The more you can limit someone being on-call at 3 AM the better. This helps your employees sleep better while also ensuring someone awake and aware is responding to the incident.
Software developers shouldn’t be exempt from taking any on-call responsibilities. In a DevOps culture, everyone from operations to development teams should take some level of on-call responsbility. By scheduling on-call duties across the entire DevOps team, the entire organization gets deeper exposure to systems in production, helping teams build and maintain more reliable services.
Some of the tools we’ve mentioned above such as manual take on-call and scheduled overrides help with this point. Just make sure you provide a culture of flexibility around on-call. As long as everyone on-call is held accountable and takes a certain amount of time on-call, they should be able to exchange schedules with teammates when possible.
By centralizing on-call schedules with contextual alerts, anyone on your team can get access to the data they need. In case an outage occurs and the incident needs to be escalated across multiple teams and rotations, everyone in the organization should have access to the information they need, when they need it.
Fast recovery is awesome. On-call misery isn’t. Learn more about some of the current pain points of life on-call in our free eBook, The 2016/17 State of On-Call. We had over 800 professionals share their stories to better understand what’s making on-call suck, so we could find out how to make on-call suck less.