VictorOps is now Splunk On-Call! Learn More.

Making the Most of Machine Learning for Incident Management

Dan Holloran September 10, 2019

DevOps Monitoring & Alerting On-Call
Use Cases for Machine Learning for Incident Management Blog Banner Image

With today’s level of automation in software development and delivery, along with continuous improvement to CI/CD pipelines, keeping up with production needs becomes harder and harder. So, how can you improve release management and deploy new features and services at breakneck speeds without hindering service reliability? To keep up with rapid development speeds, DevOps and IT operations teams need to learn how they can leverage machine learning capabilities for the upkeep of production environments.

Machine learning is influencing the way DevOps and IT operations teams are managing their applications and infrastructure. The rise in machine learning (ML) and artificial intelligence (AI) is allowing teams to create more reliable systems with predictive analytics and historical data. Through both supervised and unsupervised learning, operations teams are learning to use machine learning and artificial intelligence to create incident management systems for MLOps and AIOps.

In this post, we’ll focus on machine learning and AI in DevOps and IT operations. By the end of the article, you should better understand how DevOps and IT operations teams can use machine learning to make on-call suck less and drive rapid incident remediation.

Incident management basics

Before we can look at how MLOps is being applied to incident management, we need to first break down the incident management lifecycle. From initial on-call notification to the post-incident review and analysis, machine learning can create highly efficient processes that improve the speed, transparency and collaboration of incident management.

DevOps & Incident Management

Stage 1) Detection

How do you identify an incident in production? Incident detection depends on a combination of different monitoring practices. Monitoring logs, metrics and traces over time can help you determine what a “healthy” system looks like. Once you’re able to lay a base and determine benchmarks for health, you can set alert rules around based on the data and improve the speed at which you detect problems in your applications and infrastructure.

Stage 2) Response

Once the incident has been detected, the team needs to figure out exactly what’s going on and how they can respond to the issue. The response phase is all about triaging the alerts coming into the system and routing those incidents to the proper responders. The better you can connect people and technical systems through processes and tools, the faster you can respond to issues. An integrated solution for on-call scheduling, intelligent alert routing and real-time communication can lead to faster incident response and help teams maintain more reliable applications and infrastructure.

Stage 3) Remediation

Detecting incidents quickly and navigating through the response process is typically the most time-consuming portion of the incident management lifecycle. With the right tools and context, on-call responders can almost immediately begin taking action to remediate the problem. With ChatOps tools like Hubot, DevOps and IT professionals can even run commands or execute rollbacks through integrated chat tools. So, teams can easily be transparent about what’s happening while they simultaneously take action to restore services.

Stage 4) Analysis

Once service is restored and customers are happy, the team can start to assess what happened. The applicable engineering and IT teams can conduct post-incident reviews and learn from past incidents to improve the first three steps of the incident lifecycle. For powerful incident analysis, DevOps and IT teams need to do a good job of tracking communication and actions taken during a firefight – helping them paint a real picture of their incident response and remediation process.

Stage 5) Preparation

Now, the team can prepare for future incidents. The team has learned about how they work together, how the system functions and how they can improve. With knowledge and historical incident context at your fingertips, you can better prepare for future incidents and ensure the team has the resources they need to make on-call suck less. The rise of machine learning use cases in DevOps and IT is leading to more prepared teams and better processes for incident management.

The rise of DevOps, MLOps and AIOps

For the last decade or so, developers and IT teams have completely changed the way they work together. DevOps gave rise to shorter feedback loops, earlier testing in the development lifecycle and better collaboration between engineering and IT operations. People have found better ways to work together and deliver consistently reliable software at breakneck speeds. All the while, machine learning and artificial intelligence technology is continuously refined and is being used to solve a large number of problems.

So, IT operations and DevOps teams have also started to take advantage of machine learning and AI – giving way to the terms of MLOps and AIOps. Machine learning through supervised and unsupervised methods can create servers, networks and applications that actually learn from themselves. With the right implementation of AI and machine learning, the system can actually learn how to self-correct problems, identify common recurring issues and improve the overall incident management process.

MLOps and AIOps are leading to more reliable systems and better customer experiences – all while making on-call suck less for employees. So, let’s dive into some more specific use cases and examples of machine learning already being implemented by DevOps and IT teams.

Use cases and examples of machine learning in DevOps and IT operations

  • Similar incidents and suggested responders

Do you already have that one engineer on your team who seems to have already handled every issue in the book? Many times, this person has been in the organization for a long period of time and basically built the application or service from the ground up. Think about this person, but backed by the powerful computing capabilities of today’s technology. Over time, continuous monitoring and alerting of your applications and infrastructure can lead to more data and, ultimately, deeper system intelligence.

With more knowledge, the system learns more about itself and can proactively take action to improve service transparency and overall reliability. Some examples of machine learning in DevOps and IT manifest themselves in our own product, VictorOps. As your team receives more alerts into VictorOps, the system learns how your team commonly responds to issues and can actually suggest responders based on similar incidents from the past. Every time you respond to an incident and remediate the problem, you become more efficient.

  • Event correlation

Services like Splunk’s IT Service Intelligence (ITSI) allow you to monitor nearly any source of IT data, from metrics to logs to traces and use AI backed by machine learning to proactively identify application or infrastructure problems. By pushing all of your IT data into a powerful solution like Splunk ITSI, you can leverage machine learning to correlate alerts and create singular events. This way, you can rapidly detect the root cause of a major issue instead of responding to each individual alert as a separate issue.

With major incidents in production, multiple alerts can be sent out due to highly-integrated dependencies and third-party applications. But machine learning and AI will help DevOps and IT teams see through the noise and work faster toward incident resolution.

  • Proactive incident detection, prediction and prevention

Also with monitoring and data analysis services like Splunk ITSI, you can use predictive analytics and machine learning to preemptively trigger alerts into an incident response tool like VictorOps. It’s as if the application or service can tell you when it’s feeling cold symptoms before it becomes a full-blown sickness. When the system identifies something wrong with itself (e.g. ETL lag that has signaled major incidents in the past), it will proactively send out an alert so engineers or sysadmins can work toward a resolution before the issue affects customers.

  • IT service desk asset management

As you get older, your body starts to break down. For instance, you’ll often know you’ll need to do something about minor knee pain before it becomes a full-blown knee replacement. The same kind of principle and thought process can apply to IT asset and service management (ITSM). IT service desks and sysadmins can use machine learning to identify assets that may need replacement before they completely fail. With machine learning, you can identify service degradation and take proactive action to replace the asset before it causes a major outage.

The future of incident management is now

You don’t need to wait 20 years to start taking advantage of MLOps and AIOps – the technology exists today. NOC, SRE, IT, DevOps and security managers can already start using machine learning in their daily processes to improve overall reliability and improve incident management workflows. While DevOps principles moved the industry forward in terms of proactivity and a process no longer compromising speed or reliability, machine learning and AI are taking it one step further. The faster you implement machine learning into incident management workflows, the faster the system learns from your mistakes and the more resilient your applications and infrastructure become.

See how machine learning and highly collaborative incident management workflows consistently make on-call suck less. Sign up for a 14-day free trial of VictorOps or request a personalized demo to see it in action while also learning more about integrating predictive analytics into an alerting solution with our Splunk ITSI integration.

Let us help you make on-call suck less.

Get Started Now