VictorOps is now Splunk On-Call! Learn More.

MTTR Zero: One Weird Trick Solves All The Problems

Matthew Boeckman April 01, 2017

DevOps Company Monitoring & Alerting

We all agree the MTTR (mean time to repair/resolve) metric is core to any incident management practice. Today, we’re pleased to announce a new solution within the VictorOps toolset: automated incident resolution. This approach to resolving incidents will forever change the landscape of incident management for your team.

The basic problem, as has been explored in the past, is people. They have to be notified, they have to wake up, get a computer, investigate, diagnose, repair, and ultimately resolve the initial alert or ticket. Clearly, this is a problem that begs an automated solution, and that’s exactly what we’re excited to roll out.

A New Focus

By leveraging the incident automation engine (or Transmogrifier, for short) you can completely sidestep all those intermediate steps and just resolve an incident at the moment of its creation. This new way of thinking about MTTR significantly deprioritizes focus on the pesky repair step of incident management. Ultimately, this is all about remediating issues and closing tickets; and the historical focus on resolving underlying problems has led us down a path with no happy destination.

Now, many will rightly point out that underlying problems will persist, and new incidents will inevitably be created. This is a reasonable first response to this proposed change, but as we’ll see, the Transmogrifier approach is web-scale, and can efficiently continue to resolve incidents when the root cause has not magically self-healed.

Automation in Action

Let’s dig in for a moment to the Transmogrifier - as you may already be aware, Transmogrifier allows you to modify fields within an incident before it is subsequently processed by routing or escalation rules. This is exciting, and has a variety of interesting use cases. The real breakthrough though, is when I realized you can use the Transmogrifier to simply change the state of an alert to “OK”, “Recovered”, or whatever state is meaningful to resolving the incident (not the underlying cause).

mttr zero

There you have it. Any incident coming in as CRITICAL is instantly re-assigned as OK. Relax! It’s all going to be OK!

Let’s look at two different alerts, one with this revolutionary technique, the other still struggling along in the old, human-constrained, manual resolution way.

initial alert

Here we have an alert, paging a person. Note that start time.


And here an acknowledgement, and a recovery. All of which are mediated by a silly person doing things, and contributing to the decline of our core metric, time to resolve (TTR). This incident took three minutes to resolve. This is 2017! Wake up, sheeple! We can do better.

This time, with automated incident management:

Automated Incident Management

Bam! Time to Resolve? ZERO.

Take a minute and let that sink in.

Time to Resolve? ZERO

With automation, we have finally slain the dragon of time to resolve. This incident has been resolved in the same second it was created. There are lots of things you can do to impact MTTR, but nothing is going to be as impactful as this.

It’s About The People

We have all suffered the slings and arrows of poorly tuned thresholds, unactionable alerts, and nonsensical incident language. These incidents create fatigue, stress, and ultimately distract us from what we’re here to do, which is probably something to do with writing code.

Automated incident response is not a panacea by any means. You may find yourself leaning heavily on the CI/CD pipelines to fix some of the unresolved incidents, or at least to create new problems that deflect focus from the fact the team has stopped fixing things. Either way, your phone, Slack channel, and inbox will be quiet.

By removing the painful distraction of incident alerting, diagnosis, and repair, we empower our teams to be happier, healthier, and less concerned with the complexity of operating a modern technology stack.

(** this is an april fool’s day joke, as I hope you have already ascertained. Do not do this in your production environment. I repeat, do not do this in your production environment. I can assure you, doing so will definitely impact your team.)

Want some time to check out the Transmogrifier for yourself? Sign up for your own 14-day free trial of VictorOps to apply automation to your incident management process. Try out our Transmogrifier, 100+ integrations, on-call scheduling, alert routing, and native chat–all in one place.

Let us help you make on-call suck less.

Get Started Now