Applications need to function around the clock. However, assuming your service will never experience downtime is unreasonable—no matter how in-depth your QA is.
Interconnected systems naturally create a level of uncertainty that defines the meaning of on-call rotations. Establishing intelligent rotations means somebody is always available to address the problem, and therefore, your site reliability engineers will notice and respond quickly when incidents happen.
If your organization hasn’t traditionally kept engineers on-call, you’ll likely face initial pushback. But, over time, you’ll get buy-in from engineers as they learn how purposeful on-call rotations nurture a culture of reliability and collaboration which benefits them.
If you’re curious how we manage on-call rotations in our own tool, you can see more in this knowledge base article.
The Significance of On-Call Rotations
Simply put, being part of a rotation means you should be available to handle an incident when it pops up. This doesn’t mean on-call has to be horrible. If you’re the lucky engineer, you should wear the title like a badge of honor. You should be proud your team trusts you to solve problems quickly. Of course, with the caveat that you’re not being overworked fixing incidents while the rest of your team simply builds and deploys code, then goes home.
Team-wide participation in on-call rotations creates the best environment for collaboration and overall site reliability. When everybody is on-call at one point or another, team members will instinctively act more responsibly when deploying and maintaining their own code. This, in turn, causes fewer incidents while creating more reliable systems. Everybody’s happy! Equally distributed on-call rotations and schedules make your employees happy, improve site uptime, and ensure engineer coverage at all times.
Creating Meaningful On-Call Rotations
As a manager, setting rotations and schedules requires nuance. Therefore, your rotations need to have a well thought out purpose. Meaningful on-call rotations happen by establishing rotations and schedules that ensure site reliability and employee happiness. You need to make sure somebody is on-call at all times, without frustrating—or burning out—employees. These are a few things you need to think about when designing your own meaningful on-call rotations and schedules:
Team structure - Establish which people and/or teams should be given on-call responsibilities. Should these individuals and/or teams then be broken down into new, separate teams? Or should the general team structure remain the same? Once you define who’s going to be on-call and how the teams are structured, you can start forming organized on-call teams and rotations.
Tools and communication - Define how your on-call team(s) will communicate and understand the tools they need to effectively handle incidents. Use tools that give your engineers visibility into system operations and helps them communicate easily. Providing tools that make schedules and rotations clear for the whole team can show any schedule changes and ensure no shifts go uncovered.
Timing and coverage - Your system always needs to work. Companies that allow rotation switching will stop employees from burning out or getting stuck with the worst shifts. Allowing flexible on-call scheduling is also a good way to give employees choices while making sure there’s always somebody on-call.
Escalation policies - Being on-call means you’ll act as the first point of contact when an incident pops up. However, this initial point of contact won’t always be the right person to handle the issue. Define how certain problems should be escalated and which person or team needs to be looped in. Clearly defined escalation policies will get the right people involved only when they need to be.
Being part of team-wide rotations means you can take pride in your work. You’ll build camaraderie with your team while building better services faster. Running into system issues is inevitable, but creating on-call rotations with meaning can help you diagnose and resolve incidents more effectively while simultaneously making employees happier.
Try out VictorOps with our 14-day free trial to see how we make on-call suck less…