VictorOps is now Splunk On-Call! Learn More.
In the modern era of application delivery and rapid deployment, both developers and IT professionals need to take on-call responsibilities. Monitoring and alerting needs to encompass the entire stack. From product development to managing systems in production, you need a holistic incident response and on-call strategy that allows you to quickly identify issues and avoid downtime. IT operations teams and developers alike need to be on-call for the services they build and maintain.
However, in DevOps and IT, organizational structures are changing as quickly as the technical applications and infrastructure behind them. Because of this velocity in change, teams have a tough time standardizing on-call schedules. So, a number of teams decide to change schedules and update on-call rotations on a weekly or monthly basis – even when maintaining calendars in simple spreadsheets. However, this method of constant on-call adjustments isn’t scalable as your team grows and doesn’t allow for visibility into how often different individuals are handling different issues.
Standardizing on-call schedules allows you to set incident management KPIs and benchmarks for success. Also, it avoids confusion – the team gets shared visibility into the on-call calendar so they know who’s on-call at any given time. And, last but not least, when employees know exactly when they’ll be on-call, it brings them peace of mind and allows them to schedule personal events. A standardized on-call calendar leads to a more prepared team while simultaneously making on-call suck less.
At first, it might seem more flexible to set up the upcoming week’s on-call schedule on a weekly cadence. But, this cadence is much more rigid and can be unfair to individual team members. In fact, setting up schedules every week takes up more time than creating a standardized rotation. If a teammate can’t take an on-call shift, they should simply switch with another person who’s available. Even if you’re a smaller team, if you have more than one person on-call for a given application or service, you have enough people to set a flexible, standardized on-call rotation.
Let’s take a look at what on-call means to DevOps and IT teams and a few best practices for standardizing on-call schedules and rotations.
In general, something standardized is better than nothing. With the rise of on-call scheduling and intelligent alert routing tools like VictorOps, you can easily make rotation changes without losing any coverage. Even if you’re using a spreadsheet for maintaining your on-call calendar, you just need a basic template or outline of the on-call coverage required for each of your services.
You can organize on-call rotations by the team (backend, frontend, data, DevOps, etc.) or by the service. Much of how you decide to set up your on-call rotations depends on the product you support and how your team’s structured. Either method works – you just need to ensure the system you adopt maintains adequate coverage for the entire system.
What happens if the primary on-call user doesn’t acknowledge an alert? No matter the reason, if the first person on-call can’t acknowledge an alert notification, there needs to be a secondary on-call rotation that goes into effect. Depending on the size and structure of your team, this could be anything from automatically paging the person who’s scheduled for the next on-call shift to paging out to a completely different secondary on-call rotation.
If you have multiple offices or a lot of remote employees, it could be a good idea to standardize on-call schedules based on geography. You can rotate schedules based on location and ensure people spend less time on-call throughout the night. Geography-based, follow-the-sun rotations lead to happier employees while ensuring full coverage.
By standardizing on-call schedules, you’re creating more predictability into how people work with your systems. Then, people also know who’s available at certain times, leading to more reliable escalation processes. In DevOps and IT, any additional workflow predictability will lead to deeper reliability across all systems.
In-line with the need for predictability in on-call operations, schedules should be repeated and meetings should be held on a regular cadence. There should be some sort of daily, weekly, or monthly on-call handoff meeting every time that a new user takes an on-call shift for a service. Current on-call users can pass information to the next person taking over that shift and ensure they’re aware of any current issues. Knowing when on-call changes happen and having a plan to ensure a seamless transition is essential for any high-functioning on-call DevOps or IT team.
On-call operations is all about balancing employee welfare with service reliability and on-call coverage. In order to do this, you need to lead an organized approach to on-call incident management and response. So, let’s take a look at helpful day-to-day tips that can make on-call suck less for everyone involved.
In the modern era of software development and IT, robust mobile applications are essential to rapid incident response. You can set on-call schedules and see rotations for your own team and other teams. With a mobile app, on-call responders can see alert context in real-time and communicate with other teammates while on-the-go. You can see whether alerts are urgent or actionable, communicate around resolutions and easily reroute alerts directly from the mobile app. Instead of simple alerts via SMS, teams are able to see more of an incident’s details right off the bat.
Escalations and the process for routing alerts will be key to making on-call suck less. You can use automated escalations and alert routing rules to silence self-healing systems and get alerts to the right person at the right time. Alongside standardized on-call schedules, people know who should be handling which issues and can reroute incidents to the right person or team. Standardized on-call rotations integrated with intelligent alert automation and escalations can get notifications to the right person the first time – leading to faster incident response and remediation.
It’s easier to make minor on-call shift changes instead of completely restructuring your on-call calendar when people have unexpected absences. With functionality like manual take on-call in VictorOps, it’s easy for one person to simply take responsibility for another person’s shift. This allows for greater flexibility in on-call operations without completely moving away from your standardized rotations. Employees then feel empowered to share accountability for the services they build and don’t feel restricted from taking time off.
If a longer-term planned absence is coming up, you can easily schedule overrides. So, you don’t need to make a new on-call schedule every single week – simply substitute a different user for a shift whenever they’re planning on being out of the office for a while.
Last but not least, on-call users in VictorOps can choose how they’re paged. You can adjust this based on time of day and method of notification. So, you can set up different notification policies for different times of the day. For example, you can receive email alerts during business hours but get mobile app push notifications outside of work hours. Being able to change how you’re notified at different times can make a big difference in the quality of life for on-call responders.
A standardized template for on-call rotations also helps managers. They have more visibility into on-call shifts and can see who’s responding to issues more than others. If a manager sees that one engineer gets stuck with 3x more alerts at different times of the day or week, then maybe they need to stagger the people who are put on-call for that specific shift. This helps avoid on-call burnout and alert fatigue – making employees’ lives better without hindering incident response time.
Also, a standardized schedule ensures there are no gaps in coverage. Constant restructuring of on-call calendars can be too much change in operations. Too much change in any developer or IT workflow can potentially result in a lack of coverage or visibility. And, when new teammates join the team, it’s easier to train them and indoctrinate them into the current on-call structure. They can easily shadow whoever’s on-call and learn what’s expected and some of the tools at their disposal.
Standardized on-call calendars and an organized incident response process improves the scalability of on-call teams and the overall reliability of the system. Fewer moving parts leads to less confusion. It’s easier to templatize the overall on-call process and give users the flexibility to make small changes on the fly. Instead of building out on-call calendars, developers and IT professionals can spend time deploying new services and giving customers better experiences.
Learn about creating a holistic on-call incident management process in our free eBook, The Incident Management Buyer’s Guide. See how collaborative incident management solutions make on-call suck less and reduce MTTA and MTTR over time.