VictorOps is now Splunk On-Call! Learn More.
Incidents are inevitable. Effective DevOps and IT teams aren’t just proactively addressing the reliability of technical systems, they’re preparing for unforeseen circumstances. You can’t continuously deliver new features and services without taking on some unknown risks. So, how can the on-call team ensure full coverage and better prepare themselves for incident response without creating a culture of burnout and alert fatigue?
Complex systems have more potential for frequent alerts and failure. And, complex organizational structures on top of complex systems can create blind spots in on-call coverage – doubling down on a lack of transparency and service reliability. Without visibility into on-call schedules, rotations and teams, responders don’t know who owns which services at a given time. And, ultimately, end-users suffer poor experiences and extended amounts of downtime.
So, we put together a comprehensive template for on-call scheduling in DevOps and IT. Standardized on-call schedules and the following of a pre-designed template can reduce blind spots around errors and outages without causing on-call burnout.
The incident management lifecycle consists of five phases: detection, response, remediation, analysis and preparation. Being on-call applies to the first three phases of the incident management lifecycle. Detection leads to the initial alert sent to the first on-call responder. Response is the process of triaging and investigating the incident. And, remediation is the actual process used to fix the issue.
On-call schedules integrated with monitoring and alerting tools, alongside collaboration tools like Slack or on-demand conference calls, DevOps and IT professionals are able to diagnose and remediate incidents faster. On-call teams can use the following basic tactics to ensure alerts are sent to the right person at the right time.
Instead of setting new on-call schedules each week, try to develop a standardized cadence. This way, you can simply swap users and teams out of specific schedules when there are short-term scheduling conflicts with PTO or unforeseen family emergencies. By spending more time standardizing on-call schedules and rotations up front, you’re able to think more about the entire system and how to ensure coverage across all applications and infrastructure. It’s much faster to make short-term changes to on-call shifts than to completely restructure on-call rotations on a weekly/monthly basis.
Once you’ve built out your on-call teams, schedules and rotations, you need to ensure cross-team transparency and access to the on-call calendar. How can your teammates easily see who’s on-call for which service? Also, can teammates easily see the next time they’re going to be on-call? The calendar needs to be visible to everyone so that on-call responders can quickly understand how they can escalate issues or who they can communicate with during a 3 AM firefight.
Once people know the ins and outs of the people side of on-call schedules and calendars, you can start integrating technical systems with those schedules. At this point, you’ll define best practices for escalation in the on-call process and stick to them. Through a combination of automatic escalations (either time-based or service-based) and manual escalations, you can get issues to the right person faster.
For instance, if an incident goes unacknowledged for 5 minutes, it should automatically escalate to the next level. This way, you’ll ensure alerts don’t get dropped for long periods of time and that on-call responders jump onto issues quickly.
Alerting is only as good as the underlying monitoring. With comprehensive monitoring, you can detect anomalies and errors across all your applications and infrastructure. Based on the logs, metrics, traces and events collected, you can set thresholds to notify on-call responders in case of emergency. Then, you can implement automation in the alerting workflow to change the way alerts are routed through your on-call system and the information attached to said alerts.
DevOps and IT teams should be appending applicable information and remediation instructions to on-call notifications. This way, when an on-call responder first gets an alert, they can see exactly what’s wrong and leverage runbooks or wikis to quickly jump into remediation steps. With charts, logs, metrics and events attached to the alert, the on-call responder can also see if additional services are affected and loop in the appropriate people.
All of your alerting and on-call scheduling functionality should be built around a system for collaboration. Without the ability to chat with other teams, jump into a conference call or spin up a Slack channel to discuss the incident, with context, you spend additional time simply navigating multiple tools. This allows cross-functional teams to see exactly what’s going wrong with disparate services and come together to find solutions.
Without any kind of analytics after-the-fact, you can’t accurately conduct post-incident reviews and learn from your mistakes. Holistic reporting allows you to see the full on-call process, from incident detection and monitoring metrics to the real-time communication and firefighting that took place. You should see how people, processes and technology truly interact during an on-call incident – allowing you to take real steps toward a better incident management and response system.
While every team is built a little differently, it’s hard to define specific types of on-call schedules. But, there are a few commonalities that all teams should think about when structuring on-call schedules and alerting processes:
How do you want to structure on-call schedules? Do you want to do it based on the specific team in charge of that type of service (e.g. data, frontend, backend, mobile, etc.)? Or, do you do it based on the disparate services or applications being supported by the entire team? Depending on the type of product you’re building and the overall organizational structure, either method will work. But, you need to decide which is right for your specific business. Sometimes a combination of the two will work, but remember that too much complication in the on-call structure can potentially cause more problems than solutions.
Every on-call schedule needs a backup schedule in case anything goes wrong. These backups can either be built through a completely secondary calendar or with multiple detailed escalation policies. If an initial responder doesn’t acknowledge an alert, there needs to be a backup plan. Some options for backups could also be paging the person who’s scheduled for the next on-call shift or notifying an on-call user from a completely different team.
If your team is large enough or spread out across the globe, geography (follow-the-sun) rotations work well. This way, you don’t need to put very many people on-call throughout the night, you can take advantage of global engineering resources to ensure 24/7 coverage without interrupting employees’ sleep.
With predictable on-call schedules, the team is more knowledgeable of who’s on-call and when. And, with improved visibility into on-call calendars, they can easily see it for themselves. In a world of responding to unknown unknowns, people operations in on-call are one of the few areas you can add predictability. More predictability in on-call incident response will ease the minds of people on-call and help them better understand the process.
On-call responsibilities really don’t have to suck too much. Through constant iterations of schedules and learning from post-incident reviews, you can take action to make employees’ lives better without putting reliability in the back seat. A centralized system for on-call schedules, monitoring and alerting context, and collaboration will drive efficient real-time on-call operations. Software developers and IT teams can work out of a single tool, increasing cross-team visibility into problems and helping teams avoid silos.
Learn more about using a centralized solution for on-call scheduling, alert routing and real-time collaboration during a firefight. Sign up for a 14-day free trial of VictorOps or schedule a personalized demo to understand exactly how we make on-call suck less.