Complex infrastructure, distributed systems, CI/CD, and Agile development practices are changing the way we build and maintain services. Teams are building more in a shorter period of time than they ever have before. Naturally, this requires an even stronger need for engineers and IT professionals to share accountability for code and collectively handle on-call responsibilities.
We’re at a crossroads–a number of organizations are adopting DevOps practices while a number of organizations continue to leverage traditional ITSM (IT Service Management) and ITIL (IT Infrastructure Library) principles. But no matter your structure or maturity in IT operations and software development, you’ll need people on-call. On-call tools can help your team manage and speed up incident response, leading to less downtime and more reliable services.
Being on-call isn’t easy. But, effective on-call management and the right tools can certainly make on-call suck less. By working through your on-call process and addressing specific pain points in workflows, you can actively work to make the on-call experience better. To start, let’s list out a few common pains of being on-call and dive into ways on-call tools can help you manage those pains.
On-call tools need to be intelligent enough to strike the balance between over-alerting and under-alerting. If monitoring and alerting tools are not properly notifying on-call responders and prioritizing alerts based on severity, it will lead to alert fatigue. The effects of alert fatigue result in tired on-call responders, confusion, slower incident response, and ultimately hurt the overall reliability of your system.
Both developers and IT professionals will complain about a lack of alert context when they’re put on-call. Alerts should be delivered with relevant information in order to help on-call responders immediately take charge and work toward a solution. It’s especially nice when alerts are served in a centralized location for both new deployments and incidents currently in production. This way, DevOps teams can easily share context, escalate issues, and collaborate to remediate the incident.
This pain is especially common when teams try to build their own on-call management tools. Typically these on-call systems are simple, lack scheduling visibility, don’t have a lot of options for customization. Most, if not all, of the tasks and changes to on-call schedules are managed by administrators–leading to a large time commitment for managers. And even then, there’s little visibility across the team as to who’s on-call and when. This can cause gaps in on-call coverage, missed alerts, and slower incident response.
This isn’t a pain point for on-call alone, but siloed teams contribute to a much more difficult on-call process. By integrating developers and IT operations into a DevOps-focused culture, collaboration deepens and everyone across the organization gains more exposure to systems in production. From the development of new features to incident management and resolution, DevOps makes on-call suck less by spreading a more holistic understanding of your system across more people. On-call management tools should help break down silos and build tighter relationships across your company.
You can’t avoid some of the difficulties coming with on-call responsibilities. But, an intelligent process for on-call management combined with people-centric on-call tools can certainly make on-call easier. Continuous iteration of alert rules, monitoring thresholds, and on-call workflows will lead to more efficient incident response and remediation.
At the core of any team, DevOps or otherwise, managing on-call tools and processes needs to be people-centric. Everything you build into your on-call response and remediation strategy and tooling should be there to improve the lives of the people on your team. Making on-call suck less starts by allowing teammates to access the information they need when they need it, receive alert notifications through their preferred channels (SMS, phone, email, etc.), and communicate easily with others.
In both ITSM and DevOps, people are the key to operational efficiency and business success. On-call tools are supplemental and should only be used when they increase the productivity of the people behind them. So, we wanted to lay out a few helpful tools and capabilities for managing on-call teams.
Your on-call tools should offer the ability to see, update, and adjust on-call schedules and calendars. You should be able to create a number of different on-call teams, set rotations, and assign associated escalation policies. This allows granularity into on-call scheduling and offers teamwide visibility into on-call calendars.
On-call incident response is all about collaboration. Providing multiple avenues for communication and alert notification is essential to speedy incident response. Offer the capability for your team to communicate where they’re working, whether this is through video chat, phone calls, SMS, emails, or chat apps (Slack, Hipchat, Microsoft Teams, etc.). Ease of communication helps on-call responders get help quickly and feel like they’re not stranded on an island.
Alerts with context allow on-call responders to know exactly what’s happening as soon as an alert comes in. On-call tools should be able to automatically attach applicable runbooks, annotations, log data, or other related incident details as soon as an alert comes through. This way, whoever receives the alert can immediately run through instructions and find the information they need to start resolving an incident.
Now, we won’t dive into too much detail about these specific tools in this post, but you should check these out if you’re in the market for helpful on-call tools. Using a combination of these tools listed below can help shorten the incident lifecycle and drive operational efficiency. Learn more about these tools below:
Now we may be biased, but VictorOps helps centralize all the tools listed above with your on-call scheduling, alert routing, and escalation functionality. Sign up for a 14-day free trial to try it for yourself and start making on-call suck less.