VictorOps is now Splunk On-Call! Learn More.
Have you been on-call? If you have, you understand the anxiety, the frustration and the confusion involved with it. And, if you haven’t been on-call in the past, with the growth of DevOps and the focus on developers owning their code, it’s likely you will be soon. Sharing accountability for the reliability and performance of your applications, services and infrastructure is becoming the norm. Meaning, it’s more important that than ever to create a collaborative DevOps environment for software delivery and incident response – the right on-call tools help you do this.
So, we put together this post to serve as your 2020 guide to on-call tools. We’ll go through everything a development and operations team needs in their on-call toolchain to reduce alert fatigue and burnout while simultaneously delivering better customer experiences. From technical support teams to IT infrastructure engineers, efficient on-call processes and tools are at the core of efficient incident response and management – giving your team more time to write code for new features and functionality.
So, without further ado, let’s dive into the way automation and thoughtful, collaborative workflows in your on-call tooling can lead to more resilient systems and teams.
If you’re a software developer, you’ve likely never been on-call. In fact, a lot of developers aren’t even familiar with the idea of being on-call. But, the rise in microservices, cloud-based architecture, containers and the DevOps mindset is giving developers more control over their infrastructure – allowing developers to take more responsibility for uptime and service performance. The lines between IT operations and software development are blurring and silos are breaking down. Developers need to care just as much as anyone else in the organization about how their code performs in production and affects customers.
So, if you take a new job and get put into an on-call rotation, take it as a badge of honor. If your team encourages a collaborative DevOps process throughout CI/CD, testing, QA and incident response, on-call won’t be awful. The organization’s culture and the maturity of their engineering efforts will determine their dedication to reliability and, ultimately, an on-call checklist focused on proactively fixing problems. The more reactive you are in incident response, the harder it is to maintain positive customer experiences and build a culture that doesn’t wear down on-call engineers and teams.
Being on-call is a requirement for an effective DevOps culture in 2020. Everyone in the organization shares accountability for the way customers experience their applications and services. IT managers and sysadmins can’t be solely responsible for the uptime and security of their infrastructure, it needs to be a full-team effort. Tightened on-call collaboration between developers and IT engineers, alongside the business teams, will lead to happier customers and success for the business.
As you’d expect, incident management and real-time response is changing as applications and infrastructure change. Apps and services are delivered far differently than they were in the 90s and early 2000s. The growth in Agile software delivery, DevOps and the application and infrastructure technology adopted by engineers is completely changing what’s possible for incident response. Automation, machine learning and AI is already offering tons of value for ITSM (IT Service Management) and DevOps-minded engineers, and we’re only scratching the surface.
Incident management in ITSM, historically using the ITIL (IT Infrastructure Library) principles, was very simplistic. As soon as support engineers, QA teams, customers or other IT professionals detected a problem, the team would submit a ticket. The ticket would often go to a NOC or a SOC (Network Operations Center or Security Operations Center), and the issue was routed or escalated to the appropriate team. But, this model involves a lot of human involvement and often leads to confusion or mis-routing of alerts in larger organizations. If the NOC or SOC can’t take care of the issue with level 1 support technicians, the issue can become totally convoluted.
And, not only is it harder to triage a problem and resolve it in real-time, tickets would often get stuck or lost in a queue somewhere. Today, automatic prioritization, on-call schedules and alert routing rules are allowing teams to quickly determine the severity of a problem and get the right engineers involved to fix it. Instead of first sending an issue to an on-call support engineer who’s never worked on a service they were notified for, the alert goes directly to the person who pushed the code or initiated the configuration change that caused the incident.
On-call tools and systems will need to be configured differently for different teams. But, in 2020, there are a few key components that an on-call tool needs for any DevOps or IT organization.
Your on-call tools need to allow for calendar visibility across all teams and schedules. Even better, if you can tie your on-call rotations directly to underlying services and applications in your architecture, you can automatically fill gaps and ensure someone will be notified whenever there’s an issue. Even if you’re on the on-call engineer for the frontend web development team, incidents often affect other related services, so it’s good to be able to look at on-call schedules for other cross-functional DevOps and IT departments.
With visibility across all on-call rotations and schedules directly tied to technical services and infrastructure, you can easily switch shifts with teammates and ensure no drop in coverage. This allows for improved on-call quality of life – engineers can work with their managers and teammates to give them more flexibility with their on-call schedules. Now, if you need to take a sick day or if you have a big family event planned in the future, you don’t have to worry about your on-call responsibilities – someone can get it covered.
No need to send alerts into a call center or NOC in order to just get those notifications routed to you in the long run anyway. An intelligent rules engine and alert automation should come standard in your 2020 on-call tools. Based on information in an alert payload or based on the monitoring system or service firing the alert, you can determine where that alert should go. As soon as your monitoring systems begin to throw errors, those errors can be routed in real-time through your on-call schedules and notify the right person (or people) needed to fix the issue. Alert automation and routing reduces human error involved with on-call notifications as well as speeding up your mean time to acknowledge and resolve (MTTA/MTTR) over time.
And, if alert automation doesn’t always work (which is inevitable for your system’s unknown unknowns), you can easily escalate issues, either manually or automatically. And, when you escalate an incident, you can share the entire payload with an on-call responder and communicate with them about what has already happened or what needs to happen. If you’re dealing with a major incident, on-call tools need to have the ability to notify on-call engineers via multiple escalation paths and get numerous people/teams involved. In a mere seconds, you’ve mobilized engineers from all sorts of disciplines to collaboratively work on fixing issues and restoring uptime.
On-call tools in 2020 must provide context with alert notifications. We don’t live in an era where you simply check a pager and see that something is wrong. Now, you can be notified directly to your mobile phone and your computer simultaneously with a complete alert payload. And, alongside the logs and metrics associated with the alert, you can use automation to embed visualizations, dashboards and other helpful alert annotations. The faster a responder gets the context surrounding an issue, the faster they can take action on it.
With rapid notification and the appropriate context, engineers now need to be able to do something with that information. Automation can append runbooks and playbooks to an alert and give on-call engineers instructions or a process for remediating an incident. You could even build playbooks through orchestration tools such as Splunk Phantom to automatically execute a workflow or script if some sort of monitoring threshold is met. Now, you’re starting to realize the full value of automation at every stage of the incident response lifecycle.
Okay, so developers and IT operations teams who take on-call shifts now have the automation and knowledge at their fingertips to quickly notify responders and give them a way to take action. But, what if the action requires a collaborative effort across multiple teams and services? The on-call tools can’t simply send alerts to engineers without giving them a place to work through problems.
Native chat, SMS, email capabilities and tight integrations with video conference software and chat tools should give your on-call solution a way to mobilize responders and communicate in real-time. No matter where engineers decide to communicate, the on-call tool can create a central repository of your incident response history – from chat to rollbacks executed, etc. Then, integrated with incident management and ticketing services like ServiceNow or Jira, you can automatically update documentation without taking your focus off the firefight at hand.
On-call tools start simply with schedules and alert automation but only truly finish the job when they offer a platform for real-time, collaborative incident response.
With all the incident documentation at your disposal, you can use detailed reports and historical data to improve the efficiency of on-call teams over time. Post-incident reviews can be conducted around real events that occurred during incident response and action items can be assigned to make on-call suck less the next time around.
In 2020, your on-call tool should allow you to quiet unactionable alerts while consistently surfacing the ones that need attention. This balances the needs of customers with a development and IT culture that avoids burnout and alert fatigue. For CIOs and CTOs, this means you can attract top talent and ensure employee morale without worrying about lost revenue or costs of downtime due to unreliable systems. For engineers, this means everyone can share accountability for the performance and uptime of their systems without living in anxiety or fear of that next on-call notification. For customers, this means you can depend on the application or service and will likely receive faster response when bugs or incidents do pop up.
VictorOps is the ultimate tool for on-call scheduling, alert automation and real-time incident response. Try out a free trial today to learn how our collaborative approach to on-call notifications and incident response makes on-call suck less.