Matthew Boeckman - April 01, 2017
We all agree the MTTR (mean time to repair/resolve) metric is core to any Incident Management practice. Today, we’re pleased to announce a new solution within the VictorOps toolset: Automated Incident Resolution. This approach to resolving Incidents will, we think, forever change the landscape of Incident Management for your team.
The basic problem, as has been explored in the past, is people. They have to be notified, they have to wake up, get a computer, investigate, diagnose, repair, and ultimately resolve the initial alert or ticket. Clearly this is a problem that begs an automated solution, and that’s exactly what we’re excited to roll out.
By leveraging the Incident Automation Engine (or Transmogrifier, for short) you can completely sidestep all those intermediate steps and just resolve an incident at the moment of its creation. This new way of thinking about MTTR significantly deprioritizes focus on the pesky repair step in Incident Management. Ultimately, this is all about closing tickets, and the historical focus on resolving underlying problems has led us down a path with no happy destination.
Now, many will rightly point out that underlying problems will persist, and new Incidents will inevitably be created. This is a reasonable first response to this proposed change, but as we will see, the Automated Incident Resolution approach is web-scale, and can efficiently continue to resolve Incidents whose root cause has not magically self-healed.
Let’s dig in for a moment to the Transmogrifier - as you may already be aware, Transmogrifier allows you to modify fields within an incident before it is subsequently processed by routing or escalation rules. This is exciting, and has a variety of interesting use cases. The real breakthrough though, is when I realized you can use Transmogrifier to simply change the state of an alert to “OK”, “Recovered” or whatever is meaningful to Resolve the Incident (not the underlying cause).
There you have it. Any Incident coming in with as CRITICAL is instantly re-assigned as OK. Relax! It’s all going to be OK!
Let’s look at two different alerts, one with this revolutionary technique, the other still struggling along in the old, human constrainted, manual resolution way.
Here we have an alert, paging a person. Note that start time.
And here an acknowledgement, and a recovery. All mediated by a silly person doing things, and contributing to the negative of our core TTR metric. This incident took three minutes to resolve. This is 2017! Wake up, sheeple! We can do better.
This time, with Automated Incident Management:
Bam! Time to Resolve? ZERO.
Take a minute and let that sink in.
With automation we have finally slain the dragon of time to resolve. This incident has been resolved in the same second it was created. There are lots of things you can do to impact MTTR, but nothing is going to be as impactful as this.
We have all suffered the slings and arrows of poorly tuned thresholds, unactionable alerts, and nonsensical incident language. These Incidents create fatigue, stress, and ultimately distract us from what we’re here to do! Which is probably something to do with writing code.
Automated Incident Response is not a panacea by any means. You may find yourself leaning heavily on the CI/CD pipelines to fix some of the unresolved incidents, or at least to create new problems that deflect focus from the fact the team has stopped fixing things. Either way, your phone, slack channel, and inbox will be quiet.
By removing the painful distraction of Incident alerting, diagnosis, and repair, we empower our teams to be happier, healthier, and less concerned with the complexity of operating a modern technology stack.
(** this is an april fool’s day joke, as I hope you have already ascertained. Do not do this in your production environment. I repeat, do not do this in your production environment. I can assure you doing so will definitely impact your team.)