Being on-call sucks. Every DevOps or IT operations incident manager has heard or said this at one point or another. But, incidents are inevitable in the new world of CI/CD and rapid development. In order to maintain highly reliable systems, being on-call is an essential part of DevOps and IT operations workflows. So, we’ve written this post to be your on-call template, a guide dedicated to making on-call suck less for your employees and improving overall incident response and remediation workflows.
From the basics of building an on-call shift schedule to leveraging an automated rules engine to route, escalate and transform alerts, the on-call template helps you understand it all. But, it’s important to note that building a successful on-call template for every single team is simply not a realistic expectation. So, keep in mind that the template we’ve built should only serve as a rough guideline and should be editable to fit the needs of your specific DevOps or IT team.
Let’s take a look at some of the basic types of on-call schedules and rotations that can help your team setup an on-call calendar and hate on-call responsibilities just a little bit less:
When building out your on-call calendar, you need to think about two main things:
If you’re working on a small team, your calendar should look a lot different than a larger team’s would. When building on-call schedules, it’s important to constantly balance your time spent on-call responding to an alert against time spent developing new features and services. Continuously ask yourself if your on-call rotations are delivering the most value to your customers while not causing alert fatigue or employee burnout.
Here are a few options:
Individual developers and IT operations people can take on-call responsibilities on a monthly cadence. This is a great schedule for a more robust architecture that encounters fewer issues or if on-call responsibilities are broken down and assigned to microservices or features within the larger system. This schedule allows people to know when they’ll be on-call further in advance and take longer breaks from being on-call. But, monthly schedules typically don’t work for smaller teams or for unstable infrastructure or applications. Alert fatigue and burnout quickly take over when people are over-alerted or take too many on-call shifts – raising MTTA/MTTR and making on-call suck.
Weekly schedules offer a little more flexibility than monthly schedules. Going on-call for a week at a time allows teammates a decent amount of rest time while not putting people on-call for too long. Weekly schedules are a great middle-ground between daily and monthly on-call schedules – and typically work for most businesses – regardless of team size or system architecture.
Daily on-call schedules can work if you have a decent-sized team and a flexible scheduling tool. Daily schedules offer the most flexibility and make the idea of going on-call less intimidating. However, if you’re simply trying to keep up with these schedules in a spreadsheet or some other manual process, it can quickly become more work than it’s worth. But, one concern with daily on-call schedules is there may not be enough exposure to production in order to help maintain highly distributed systems or complex architecture.
Flexibility and customization is the key to creating effective on-call rotations and schedules. Typically, some combination of the schedule types above is the best way to approach on-call. No matter which schedule is right for your team, flexibility is essential to balancing service reliability with employee welfare. Allowing employees to switch on-call shifts, snooze alerts or customize notification methods can go a long way in making on-call a little bit better for your team.
Every team needs to think about three main things when building an on-call template. One, how reliable is your current service? Two, who’s going on-call? And three, how big is your on-call team going to be? The answers to these questions will help define much of the way you set up an on-call calendar.
No two teams are the same, so a one-size-fits-all template for on-call responsibilities simply doesn’t exist. But, by understanding the benefits and concerns associated with different on-call strategies, you can build a template for your own team that allows you to continuously deploy new features and maintain reliable services.
A constant balancing between the benefits and concerns with team morale, on-call coverage, overall service reliability, customer experience and outage communication helps teams continuously improve the resilience of their systems. No matter which type of on-call schedule you decide to implement, there are common concerns among every team.
Is the on-call team getting enough exposure to the systems they’re responsible for to properly remediate or escalate incidents in a timely manner? Is the schedule flexible enough to optimize employee happiness while always maintaining coverage? Are you allowing enough rest time between on-call shifts to ensure your team avoids burnout and approaches alerts with a fresh mind? As your system changes and grows over time, is your on-call approach still the best for the current setup?
Concerning yourself with these types of questions will help you constantly iterate on your processes and tooling to optimize the on-call template in a way which makes both customers and employees happier.
Creating a general outline for on-call and templatizing your team’s process for incident response will lead to greater workflow efficiency. But, remaining flexible with your on-call template is important for employee happiness and overall incident responsiveness. As long as the entire team understands the responsibilities and expectations around being on-call, and buys into the process you’ve built, you can continue to make the system more robust. Then, when new people join the team, they can quickly see the on-call process and start adding value faster.
Incident response is more important than ever before. Development teams are building highly integrated services and are deploying faster. This velocity and system integration naturally leads to incidents. It’s naive to assume your services will never encounter downtime or suffer some sort of problem. Therefore, making an on-call template and plan becomes essential for any successful DevOps or IT operations team. Before you canmake on-call suck less across your entire organization, you’ll need to first develop a plan.
An integrated IT alerting and on-call scheduling solution will surface alert context faster and improve team collaboration – helping shorten the incident lifecycle and speed up incident response. And, that’s exactly why we built VictorOps. Download our free Incident Management Buyer’s Guide to learn more about the must-haves for on-call incident management teams and software.