VictorOps is now Splunk On-Call! Learn More.
Tara Calihman May 14, 2015Monitoring & Alerting
I had the pleasure of hearing Ryan Frantz speak (and play the harmonica) at DevOpsDays Rockies a few weeks ago. His talk - “It’s 3am…do you know why you got paged?” - struck a chord and has stuck with me since then. Not just because it’s the exact problem that we have been working to solve at VictorOps but also because everything he said made complete sense.
The entire talk centered around why adding context to your alerts is a good thing. As everyone knows, there are no “good” alerts. We all expect that when we get an alert, something will be wrong. Add to that anxiety the heavy cognitive load of having to switch context. If that alert wakes you up, it’s going to take a lot of mental energy to figure out what’s going on….especially if that alert is not actionable.
Ryan also touched upon alert fatigue in his talk. We know from our 2014 State of On-call Report that almost 63% of all those who responded think alert fatigue is a real issue and have suffered from it. A few symptoms of alert fatigue include sleep loss, heightened anxiety, lack of context and a high volume of alerts providing low value. Being on-call is stressful enough without having to worry about alert fatigue.
So what makes for good context? A few things that Ryan mentioned: alert severity, when the last deployment occurred, what the customer impact was, the current state vs. threshold, downstream effects of the alert, author’s intentions and including a link to a runbook.
This was one of the main takeaways from Ryan’s talk and one that I loved. Here at VictorOps, with the introduction of our Transgmogrifier feature, you can now include runbooks, graphs and links to other documentation right with the alert. The on-call person has the resources they need to solve the problem right at their fingertips…a perfect example of making your incident management tool work for you.
My ears perked when Ryan mentioned the OODA loop. If you don’t know, OODA stands for: Observation, Orientation, Decision, Action. Most monitoring systems live between the observation and orientation steps. According to Ryan, orientation is the most important part of the loop because this is the phase where we get to augment and the one that shapes the way we observe.
Ryan ended his talk mentioning the tool he built to make Nagios alerts more effective. It’s called nagios-herald, it’s awesome and it’s exactly what we’re trying to do for alerts of all kind.
If you haven’t had a chance to see Ryan give this presentation on contextual alerts, I highly recommend it. With or without the harmonica open, it’s spot on. And if you just want to hear more about Ryan’s thoughts on DevOps, be sure to check out our interview with him on that very topic.