VictorOps is now Splunk On-Call! Learn More.
Today, software developers and sysadmins alike are waking up to critical incidents at 4 AM. They’re collaboratively taking on-call responsibilities for applications, infrastructure and networks – working together to maintain uptime and availability of service operations. However, frequent alerts can easily lead to employee burnout and actually hinder service stability. Establishing a humane on-call culture will lead to happier employees, happier customers, and IT Infrastructure Library (ITIL) service operations that don’t suck.
According to IT Process Maps, the service operations lifecycle in the ITIL model includes “the fulfilling of user requests, resolving service failures, fixing problems, as well as carrying out routine operational tasks.” So, it stands to reason that effective alerting and real-time incident response processes can lead to more resilient ITIL service operations. From an efficient release management pipeline to faster incident response and more contextual alerts, ITIL service operations benefit from a holistic approach to people, process and technology.
So, we created this article as a walkthrough for promoting a humane on-call culture while simultaneously improving alerting for ITIL service operations.
In ITIL service operations, IT teams were responsible for the implementation, maintenance and upkeep of business-critical applications and infrastructure. From small internal employee requests for new equipment to managing the deployment cadence of new customer-facing features, service operations handled everything. ITIL service operations include specific instructions for event, problem and incident management. Additionally, service operations teams were in charge of access management and other IT operations controls for employee applications and infrastructure.
In ITIL, the IT operations team is tasked with handling nearly every issue within production environments and the physical facilities and office spaces themselves. Service operations teams would be made up of multiple people from different disciplines. Some people would help with the configuration and deployment of new applications and infrastructure, some would help with incident response and remediation in production, and some people would help set up monitors and keyboards for new employees. While these tasks are much different from one another, all of these responsibilities fell to the same team.
In smaller organizations, one IT person can be responsible for everything from IT security to setting up the physical hardware for a new hire. It’s unfair to customers and employees alike to force IT organizations to take accountability for everything while software developers sit idly by. So, organizations are leaning into the idea of DevOps – shortening feedback loops between developers and IT professionals and allowing both teams to share ownership of service reliability.
With more (knowledgeable) people on-call, it reduces the time it takes to respond to incidents and fix problems. It also frees up time for both developers and sysadmins so they can focus on delivering more strategic value. Alerts can go straight to the person or team responsible for the error or failure – helping businesses remediate incident faster without slowing down development speed.
DevOps isn’t about forcing IT professionals to write code and developers to configure servers. A powerful culture of DevOps is about capitalizing on the strengths of IT operations and software developers in a more efficient way. Why should an IT service desk employee spend time rerouting tickets to a DevOps team for an infrastructure issue that could have easily been sent to the DevOps team in the first place? If this happens at 3 AM, two people are woken up instead of one.
Developers should share accountability with IT professionals for the services they build and maintain. This way, alerts aren’t sent to multiple people when they only needed to go to one person. Making alerting better is only partially about restricting the number of alerts coming from your system, the other half of the argument relates to the routing of those alerts. Teams need to know when a feature or service is encountering performance degradation or downright downtime, so you can’t simply choose not to be alerted. But, what if you could make sure alerts are sent to the right person the first time?
Most cultures where burnout and alert fatigue are prevalent are due to fewer people taking on-call shifts and responding to more alerts. Thoughtful separation of service ownership and intelligent alert routing and escalation can lead to on-call that doesn’t suck. Are there numerous non-critical alerts waking employees up in the middle of the night? Why don’t you delay those notifications until the morning? Should there be two people on-call for a specific service or at certain times of day where traffic frequently spikes?
The phrase, “work smart, not hard” comes to mind when you’re trying to limit alert fatigue and employee burnout. There’s no reason that one person should be responding to all application and infrastructure issues 24 hours a day, 7 days a week. In fact, if there’s only one person responding to all of these issues, it’s likely they’re not the right person to solve most of those issues. So, why even notify this person? The alert should go directly to the proper responder the first time, limiting the number of notifications going out to humans.
As a team scales, runbooks, logs, traces, charts and other useful context can be automatically attached to alerts, helping teams respond to incidents in services they may not have previously touched. After optimizing the way alerts are routed throughout your on-call rotations, you can start to append more information to alerts and increase the odds that a first responder can actually fix the issue.
A combination of shared accountability, automation and intelligent alerting can lead to on-call incident management that doesn’t suck. Sometimes quieting alert noise is less about turning off alerts than it is about changing the time, location and method of those notifications.
Developers and IT teams that share on-call responsibilities and take mutual ownership for their services will create a more sustainable, humane culture. And, with intelligent alerting and a plan for collaborative incident response, teams are maintaining rapid software delivery without hindering service reliability. When ITIL meets DevOps, you’ve added your last piece for building a holistic on-call alerting process that makes incident management suck less.
Both ITIL service operations and DevOps-centric organizations are making incident management suck less through alert automation, intelligent escalations and integrated on-call schedules in one centralized tool. Sign up for a 14-day free trial or request a free personalized demo of VictorOps to see for yourself how you can build a more humane on-call experience.