VictorOps is now Splunk On-Call! Learn More.
Being on-call in DevOps and IT means being available to both customers and employees in a number of different ways. You need to be available for real-time incident remediation and open to contact from nearly any channel (e.g. Slack, SMS, phone calls, email, etc.). If a server crashes and customers experience downtime, every second costs money and results in lost business.
Incidents are inevitable in today’s world of integrated software development and IT operations, DevOps, and the faster delivery of more complex systems. In order to maintain resilient services, DevOps and IT teams need rapid incident notifications and appropriate context to quickly diagnose problems and fix issues. Automated call routing is one surefire way to keep customers, help desks and engineering teams connected and informed.
Automation can help end-users, whether internal or external, to reach the proper team and allows teammates to collaborate faster. From initial notification to the real-time firefighting conference bridge, automation fits hand in hand with call routing to drastically reduce MTTA and MTTR over time. Let’s quickly look at what it takes to build an incident response process and how DevOps and IT teams are using automated call routing to make it even more efficient.
The incident management lifecycle falls into five stages – detection, response, remediation, analysis and preparation. But, the largest impact to customers and employees alike is during the incident response phase. According to our State of On-Call Report, 73% of the time during an incident’s lifespan is spent in the response stage. And, 32% of people questioned said that triage (gaining situational awareness and getting others involved) presents the most challenges while 19% thought real-time investigation was the most difficult part of the incident lifecycle.
So, it stands to reason that faster notifications with deeper context and better collaboration across teams will make the largest impact on major incident response. How can you inform on-call responders about a specific issue faster? Can you give them the information they need so they can escalate incidents to the right person or team and get incidents resolved faster? And, once the proper context is served to the right people, do you have processes and tools in place to improve real-time collaboration during a firefight?
Major incident response requires a centralized source for alert context and collaboration – driving less confusion and more visibility during a firefight. And, even with disparate alerts across multiple applications and infrastructure, the team can easily diagnose problems and take action to fix the real root cause of an incident.
Jumping on live conference calls can help distributed teams work faster during incident response while also keeping records of all communication that occurred. So, we’ll dive into the details of call routing during incident response and how automation is being used to make the process more efficient.
In DevOps, automation is a core principle. Automating workflows will take away from human error and improve the way people, processes and technology interact with each other. Automated call routing and alerting combined will not only improve incident response speed but it will create more positive experiences for end-users. Let’s take a look at some of the ways automated call routing can improve collaboration during a firefight and bolster incident visibility.
A live call routing system can be used to get problems to the right person faster. If a customer or internal stakeholder calls in to report an incident, they can simply press 1, 2 or 3 to escalate the incident to different teams (NOC, development, tier 2 support, etc.). External customer reported incidents will probably get fewer options for escalation but the general process would still work the same.
Integrated with an incident response and management tool like VictorOps, teams can easily create incidents and reroute alerts to the proper team. Then, they can easily add callers to a conference bridge and indirectly, yet almost instantaneously, connect customers to developers who need to fix an issue. Automated call routing gives customers a voice, help desks an easy way to prioritize issues and escalate them appropriately in real-time.
Once the incident has been created and the right responders are on a call together, they can start triaging the issue. Initial responders should be attaching as much context to escalated alerts and explain the situation as thoroughly as possible. So, once the conference call is going and developers and IT professionals from disparate services are collaborating, they have the information they need right away.
The initial reporter of the incident feels appeased, on-call responders have the context they need and everyone knows the issue is being looked at. Instructions in the form of runbooks and monitoring data can be attached to alerts to show exactly what’s going wrong and how to fix it. So, the fix can be as easy as executing a command or two, or rolling back a deployment. And, once the issue is resolved, you can reach back out to the customer or end-user who initially noticed the problem and give them the good news.
Automated call routing creates a seamless integration between end-users, help desks, engineering teams and the technical applications and infrastructure they support.
But, automated call routing is only one small part of incident management and incident response. While it helps connect all of the processes involved in major incident response and on-call notifications, call routing is meaningless without highly observable services and collaborative workflows. Equipped with better information and a way to act upon that information, you’ve built a fully-integrated on-call incident management process.
It’s not enough to alert on-call responders to a problem. You have to help them quickly understand what the problem is. Automated call routing helps teams report issues faster, surface situational context and actively communicate about issues. If you record conference calls during a firefight, you also have more fodder for post-incident reviews – helping you improve collaboration in the future.
Learn more about using a live call routing system to make on-call incident management suck less. Sign up for a 14-day free trial or register for a free personalized demo to see how your own team can start using automated call routing and contextual alerting to lower MTTA and MTTR.