Efficient DevOps and IT teams are constantly getting better at maintaining a CI/CD pipeline and deploying new code quickly – sometimes thousands of times per day. But, without a focus on reliability, the speed at which you release new products and services doesn’t mean much of anything. With great delivery speed comes great on-call responsibility (Sorry – bad Spider-Man reference). In order to help you manage your rotations and facilitate a positive on-call culture, we created this on-call checklist for new teams.
If you’re not willing to take on-call duties and help maintain the services you build and operate, then you won’t be able to meet customer expectations. On-call shouldn’t be a hindrance to the speed required to deliver reliable services and value to customers, it should be complementary. With developers and operations professionals sharing on-call duties, everyone gets deeper exposure to systems in both staging and production – leading to more reliable applications and infrastructure.
Without further ado, let’s dive into the new team on-call checklist:
It’ll become apparent when you’ve built a system complex enough and established a large enough team that on-call rotations are necessary. At this point, you’ll want to start thinking conceptually about how on-call should look for your team. How often does everyone go on-call? Do you need 24/7 coverage? If so, what’s the best rotation to ensure proper on-call coverage without causing alert fatigue?
After asking yourself a few questions about your ideal on-call setup, you should do a quick assessment of your monitoring, alerting and communication processes and tools. What needs to be added or removed in order to make on-call work best for you? Should you look at purpose-built on-call software or build out some kind of homegrown system? For most teams, it’s more efficient and scalable to leverage an existing on-call solution than to lose development time and create a piecemeal alerting and scheduling system.
Once you’ve mind-mapped your on-call setup – from people to processes to tools – you can start to implement this logic in some kind of on-call tool. Because making on-call suck less is what we do, we’ll walk you through a simple on-call checklist, using a basic VictorOps setup as an example.
First thing’s first – on-call administrators need to walk through a separate on-call checklist before anyone else. Setting up the basics of scheduling, alerting and team permissions will be essential to building an efficient system for on-call teammates. Let’s take a look at the process for setting up new on-call teams and some of the functionality that admins need to keep in mind when getting started:
If your company is using an IDP (Identity Provider) such as OneLogin, Okta, or Google – you should set this up with your on-call incident management tool first. This way, logging in becomes standardized across the organization and will help admins manage users and permissions through the SSO provider. Learn more about setting up SSO with VictorOps here.
Because you thought about this beforehand, this step should be pretty easy. You’ll just need to add users who will be taking on-call shifts into the system. Some teams will even add users who aren’t on-call to their incident management software (e.g. product managers, senior management, etc.) in order to improve visibility into system health and on-call operations. Once you have everyone in the system, you’ll be able to start setting up more of the alerting and collaboration logistics. Click here to learn more about adding and removing users from VictorOps.
In SRE, DevOps and IT, on-call alerting and escalation isn’t typically done based on individual users. So, on-call admins also need to go into the system and start creating teams and assigning users to those teams. This way, you can granularly route or escalate incidents to the proper teams or individual users. Users can be associated with multiple teams if necessary in order to achieve total on-call coverage.
Teams can be based upon specific products or services maintained, or the teams can be structured based on engineering disciplines (e.g. platform, data, front-end, etc.) – or a combination of both. Remember to build out your teams through the lens of being on-call and how these team structures can help optimize real-time collaboration. Learn more about creating teams in VictorOps here.
From here, the on-call admin(s) will need to assign user roles and permissions. In VictorOps, you can break down roles into users, alert admins, team admins and global admins. Global admins have no restrictions whereas users have the most limitations. It’s important to properly assign roles and user permissions so everyone has the level of autonomy they need to make the most of your on-call solution. In this knowledge base article, you can learn all about assigning user roles and permissions in an incident management tool.
Now you get into the fun stuff. Before you dove into building out your on-call system, you should’ve already thought about the way your specific on-call rotations and schedules would look. Now, you get to assign on-call rotations which are recurring schedules with assigned team members. Here’s where you’ll customize who’s on-call and when, how often team members go on-call, and establish weekday and weekend shifts. To read more about on-call rotations, click here.
It’s important to note that users in a rotation, in VictorOps, aren’t necessarily on-call unless the rotation is associated with an escalation policy. Escalation policies determine which incidents are routed, to whom they are routed, when they are routed and how they are escalated.
So, you’ll want to go in and create some basic escalation policies based on the rotations and on-call schedules you set up just before this. After building some basic escalations, you can add multiple escalation policies to create a more robust on-call structure. This way, you can build redundancies and help streamline your incident management processes as your team grows. Well-adjusted escalation policies can greatly reduce alert fatigue and lead to much faster incident response. Get some more information about setting up escalation policies here.
Getting even more into the weeds, you’ll now set up your alert routing keys. These routing keys can assign specific types of alerts to specific teams or users. These routing keys become imperative when you centralize numerous monitoring tools and alerts in one place but still need those alerts sent to the right people at the right time. You can fully customize routing keys and assign escalation policies to each of the routing keys to make sure the correct on-call responders are alerted each time.
As time goes on and the team grows, you can easily reconfigure escalation policies and routing keys – leading to greater on-call flexibility and agility. Find more details about setting up routing keys in VictorOps here.
This is where on-call incident management software takes a real leg-up over homegrown solutions. With 100+ integrations for log management, infrastructure monitoring, communication, ticketing, security and APM tools, not including our APIs, REST endpoint and customizable webhooks, VictorOps can connect with nearly any data source you could imagine.
Rather than trying to build out all of these connections manually and making sure alerts are properly associated with the data behind them, the software does it for you. Being able to pull monitoring data into a centralized timeline, alert the proper on-call responder(s) and collaborate in real-time – all in a streamlined workflow – makes on-call suck a whole lot less.
Now that you’ve set up your monitoring data sources and started to take advantage of most of the alerting and on-call functionality, we can look at automating the process even more. In VictorOps, you can use an intelligent rules engine called the transmogrifier to automatically serve on-call engineers with helpful information and resources, in-line with alerts.
The transmogrifier can automatically surface charts from your monitoring tools and provide links to wiki pages for runbooks or conference calls. Before your team even sees the alert, the rules engine, based on set conditions, can initiate actions and automatically serve on-call teams with the information they need when they need it. To learn more about using a sophisticated rules engine for on-call incident response, check out our knowledge base page.
Without tracking the efficiency of incident management and on-call response, it’s hard to determine if you’re actually improving or not. Being able to report on uptime and downtime and the efficiency of on-call incident response can help you see the reliability of your service and give visibility to senior management and other stakeholders.
With a homegrown solution, it’s usually hard to track incident management KPIs over time. But, in VictorOps, there are a number of reports that can help teams continuously improve on-call operations. By configuring and tracking the MTTA/MTTR, incident frequency, on-call and post-incident review reports over time, the on-call team can analyze monitoring and alerting practices and improve upon them.
While much of the admin’s on-call checklist is about setting the system up, that set up greatly affects the day-to-day lives of the on-call users. So, let’s take a peek at the on-call checklist for users to learn more about specific functionality and settings that can make on-call suck a whole lot less.
From the outside, paging seems like an outdated term – but it’s still quite in the IT and DevOps space. In VictorOps, we give users the option to determine their paging policies. You can choose whether you’d like to be alerted via SMS, email, phone call or mobile app push notification.
You can set personal paging policies and customize your paging policies based on time of the day and day of the week. This not only drives on-call efficiency but it gives on-call users more autonomy about how they handle on-call responsibilities. Learn more about setting up personal paging policies or find more information about custom, time-based paging policies here.
To become highly efficient at navigating incidents as they come into VictorOps, users need to know their options. When an alert comes into the timeline, users can do one of four things – acknowledge the alert, snooze the alert, reroute the alert, or resolve it.
Because your administrator should have established thought-out teams, rotations, escalation policies and routing keys, rerouting an incident should be fairly straightforward. Snooze is a great feature because it allows on-call responders to acknowledge non-urgent issues and re-initiate an alert when it’s a better time to get to it. If it’s 3 AM and an alert wakes you up but doesn’t actually need to be fixed until tomorrow, you can just snooze it.
Creating a collaborative incident response plan and getting buy-in from the entire team is necessary for rapid incident remediation. But, VictorOps makes it even easier with an incident-specific timeline, customizable alert annotations and a payload of incident details and data from your monitoring tools. This information is centralized in the VictorOps timeline and allows multiple people to find the information they need quickly and collaborate in-line with the alert data.
Annotations can surface a dial-in to a conference call, instructions in the form of runbooks or other helpful logs or charts. When a major incident hits your system, multiple teams or people can swarm to the problem and work to put out the fire. Learn more about numerous incident response tools in VictorOps such as alert transformations, annotations and the incident pane by visiting our knowledge base.
Having a single-pane-of-glass view for on-call calendars and schedules improves visibility for everyone on the team. You can see when you’re on-call, you can see who else is on-call at the same time, you can see your team’s schedule and you can better understand the way that escalation policies are set up. You can even export your calendar into third-party calendar applications like Outlook, iCal or Google Calendar if you prefer to view it that way. Better visibility will allow for more flexibility in on-call schedule changes and making sure nobody forgets their responsibilities.
All of the tools listed above are part of making on-call suck less. But, with scheduled overrides and manual take on-call functionality, you can make short-term schedule changes and make sure on-call is always covered while giving more flexibility to users. Scheduled overrides can be used to request coverage for planned absences and manual take on-call allows a user to take someone’s on-call shift in real-time. This way, you don’t have a loss in coverage or increased confusion due to a sick day or something of that nature.
Less-specific to the system you’re using but still very important is just making sure users understand the incident lifecycle and the overall hierarchy for on-call alerts and actions. The incident lifecycle always goes like this: Detection > Response > Remediation > Analysis > Readiness. You can dive deeper into each stage of the incident lifecycle in our Incident Management Handbook.
Even more specifically, you can see the common logic behind on-call alerting and dive into the process flow for VictorOps in the graphic below:
Hopefully, this new team on-call checklist will help you better understand everything that DevOps, SRE and IT teams need to think about when going on-call. The first on-call shift at a new company can be scary but it’s a badge you should wear with honor. The highest performing teams today are putting developers on-call and maintaining the services they build.
Sign up for our latest free webinar, How to Make On-Call Suck Less, to learn about small, easy steps you can take to reduce the anxiety and stress associated with being on-call. Or, if you’re more of a do-it-yourselfer, try your own 14-day free trial of VictorOps today.