Automated ChatOps in Incident Response

Dan Holloran December 12, 2018

DevOps Monitoring & Alerting On-Call ChatOps
Automated ChatOps in Incident Response Blog Banner

The pace of CI/CD has increased significantly since Agile and DevOps became mainstream. Today, development teams thrive on collaboration and conversation tools that allow them to work together and produce better software.

Despite the advancements and evolution of collaboration practices and tools, teams face conventional communication-related challenges such as those arising from differences in culture, language, and timezone.

Incident Management Challenges

It’s not possible to overstate the need for seamless, highly available, and effective two-way communication in the incident management lifecycle. Organizations often struggle to find and adopt a common collaboration platform that keeps everyone on the same page during incident detection, response, and remediation. Though many tools used by DevOps teams are automated, they still require human intervention.

Let us take an example of a service failure. The tool reports an incident to the operations team. Then, to follow up on this alert, the team has to:

  • Create a ticket in the tool your engineering team is using with all the relevant information to help them diagnose the issue
  • Escalate the ticket so other teams don’t miss it and you start looking into it right away
  • Keep everyone in sync as the status of the incident changes

In order to automate the above-mentioned workflow, the monitoring tool should automatically sense the irregularities (e.g. error rates surpassing the threshold or a critical failure), send an alert to the DevOps team, generate a ticket in a tool such as Jira or ServiceNow with relevant information, and escalate it to the right person in the engineering team.

As an incident responder works through an issue, ChatOps can keep everyone informed and automatically update tickets as the team moves through incident workflows. However, this is easier said than done; studies indicate that shifting from one application to another results in context switching, reducing the efficiency and productivity of the teams.

Collaborate With ChatOps

ChatOps Offers a Way Forward

ChatOps is gaining popularity as a means to make incident management more agile and less taxing for the teams involved. In fact, in a recent survey we conducted, incident response emerged as a primary use case for ChatOps, taking precedence over ticket tracking, running commands in-line with chat, and human collaboration.

ChatOps serves as the bridge between your applications and processes, collaboration tools, people, processes, and automation into a single transparent workflow. It brings the communication and the execution of software development and operational tasks to a common platform.

With the help of ChatOps, you can bring service owners, SREs and on-call engineers together to:

  • Monitor, detect, and act upon incidents without switching between platforms, building a smooth incident management workflow
  • Get sufficient incident data so ITOps and DevOps teams can acknowledge and resolve incidents directly
  • Keep your teams updated on the status of incidents, minimizing alert noise
  • Generate incident history for learning and post-incident reviews

Automated ChatOps in Incident Response

You can consider using automated ChatOps tools to further accelerate your incident response. For this purpose, teams have already started integrating chatbots that can automate conversations, call an API, reset a server, and trigger processes both internally and externally.

One of the most common chatbots used in this area is Hubot which was originally developed by GitHub. It has one of the most comprehensive sets of scripts to manage the interactions with third-party services. Other common examples include Lita, a bot written in Ruby, an open-source Python project Errbot, Cog which is extensible in any language, and YetiBot which is written in Clojure.

At present, all the above chatbots require a very specific syntax to execute commands, which means there is a learning curve involved. However, some teams inspired by JARVIS are working to integrate NLP (natural language processing) capabilities into these bots. With this capability you would be able to say, “show me the time graph for XYZ system,” or “let me see the last 10 lines in ABC system log.”

In the age of Alexa and Siri, it’s likely such bots will soon gain prevalence in development and operations. Teams can reduce their MTTA/MTTR and the cost of an outage is drastically reduced with the automation of incident management workflows.

Leverage the full power of your automation, monitoring, alerting and collaboration tools within a centralized incident management tool. Try a 14-day free trial of VictorOps to start making on-call suck less across your entire organization.

Ready to get started?

Let us help you make on-call suck less.