On-call schedules are an essential piece of maintaining sustainable on-call practices, especially if on call is not your SRE teams only responsibility. Being part of an SRE team doesn’t have to mean you’re only on call. In fact, we like to think of SRE being tightly integrated into our SDLC. SREs should be developers who also take on-call responsibilities for the products/features they build and maintain.
You can’t assume you’ll never have an incident. In a world of continuous integration and rapid development, you can’t lose focus on the importance of reliability. Because of this, your SRE team needs to be proactive and set up on-call schedules optimizing two major factors:
If your on-call schedules don’t positively influence either factor, it’s likely you should make adjustments.
Logically structured teams, rotations, and schedules can:
Engineering teams are built in a large number of ways between organizations. So, of course, on-call schedules and SRE structures will also vary from company to company.
Fostering a culture of DevOps—rapid development, code ownership, and SRE responsibilities for all team members—can help reduce the amount of siloed SRE teams that are forced to take on-call schedules 24/7. A DevOps structure for on call and SRE leads to faster incident remediation and a deeper system understanding across the entire team.
There will always be tools on the market to make on call more acceptable for SREs, no matter the structure of your team. But, the best way to make on call suck less is to follow some simple operational tips and tricks that help your people.
Building successful SRE teams starts with setting up on-call schedules that support both continuous delivery and system availability. The following tips and tricks will help you structure SRE teams and develop on-call schedules that influence rapid development and help you quickly identify, diagnose, and resolve incidents:
If possible, add on-call engineers and rotations across multiple geographic regions. Having on-call SREs stationed abroad can vastly improve the overall team’s on-call experience. But, we don’t suggest siloing full-time on-call responsibilities to any one team in any one location.
Eliminating night shifts as much as possible will improve employee health and happiness, and will actually make incident response more effective. Simply put—engineers think more quickly and coherently when they work on incidents at 10 am vs. 3 am. Due to occasional escalation needs, team structures, etc.—multi-site rotations can’t eliminate 2 am wake-up calls, but it makes them happen much less.
Engineers should never feel they’re on-call so often that they’re overworked. But, you also don’t want engineers to feel underworked either. Creating a DevOps culture where every engineer handles SRE and on-call responsibilities will typically solve this problem. Because the people building the product are also your SREs, they will have adequate exposure to code in production (because it’s theirs) and will be able to identify and remediate incidents more quickly.
After a while, you’ll start to figure out which areas of your application cause more incidents and can set on-call schedules accordingly. Try to establish a balance and give your engineers adequate time to develop, detect and remediate issues, and also conduct thorough post-incident reviews. Structuring your team to hit this balance will keep everyone on their toes and prepared for anything that comes up while they’re on call.
Everyone across the organization has a responsibility to create reliable applications and systems. If you build something and don’t have confidence in its reliability, you shouldn’t have built it. Spreading SRE tasks across the entire team makes coordinating on-call schedules easier and allows everyone to spend less time on call, while ensuring a high level of reliability.
Define and implement clear escalation policies. Don’t escalate problems that don’t need escalation, and don’t loop-in engineers that don’t need to be looped-in. A lack of well-established escalation rules can result in operational inefficiencies and take time away from development.
Know when and how to escalate certain issues, and who needs to be added to the escalation path. Our knowledge base article goes into more detail about great tips and tricks for managing multiple escalation policies.
People often think of the individual pieces of incident management. But, structuring the process as a whole, from end-to-end, is the most surefire way to make employees and customers happier.
What are the goals for your incident management? Reducing time to detection? Reducing time to incident resolution? Think about your end goals and work backward. What processes and tools can you implement to help SREs more quickly identify, diagnose, and remediate an incident? Putting procedures in place to organize on-call schedules and incident response will improve your overall service and employee productivity.
VictorOps incident management can help you structure SRE teams, create organized on-call schedules, and establish actionable escalation policies. Sign up for a 14-day free trial to see how VictorOps improves the on-call lifestyle.