Today, I’d like to build on my earlier post, and expand on the third fear that faces folks joining an on-call rotation. In that post I outlined some of the basics around monitoring and alerting practices for developers or other non-traditional ops folks. This topic begs its own post, as the idea has transformed how Incident Management works.
I remember the first time I read the phrase “Actionable Alerts”, from this post at Ted Dzuiba’s blog way back in March of 2011. If you haven’t already, go read it. Someday I hope to memorialize it in the DevOps Hall of Fame. The post made its way around my circles, as we all boggled at this crazy idea that we might say no to some alerts. In retrospect, and particularly given the wide adoption of the idea, it seems absurdly obvious. So why is something so obvious still a discussion?
The simple answer is that it’s easier to just wake up everyone than to devote time and iteration to tuning and refining these systems. The issue of actionability is inextricably tied to the team, the culture, and the environment.
- Fragile environments tend to create a culture of paranoia.
- Blameful culture breeds un-actionable wakeups.
- Stretched teams err on the side of over-alerting.
All of these examples share a common root: Fear.
Without getting deep and psychological about Fear, I want to talk about it in this context of actionable alerts. In my examples above, the real root of over-chirpy alerting systems is fear. Particularly, I think, Fear Of Missing Out.
A blameful culture will inevitably skew to criticisms of how quickly others responded to an alert, and how effectively they remedied the problem. As such, the tendency to alert early and often will sink in – leading to a ton of alert fatigue, and meaningless interruptions to the day and night.
Poor design, overstretched infrastructure, or few iterations spent on bug fixing can all contribute to a fragile environment. A light PTSD sinks in in those situations – you live alert to alert in a kind of reactionary haze. But you can’t find a way out of it, and the solution is to “get ahead of the problem” with.. You guessed it: more alerts! Spoiler warning: un-actionable alerts is not what you should focus on.
Stretched teams may be the hardest to solve for, because really, what team isn’t stretched? For many teams a belief that incidents will take over-long to remedy encourages early alerting. I once talked to a guy who had 24×7 wakeup on 75% disk utilization on his servers. Because once a cascading series of failures triggered by a disk full condition caused a 3 hour outage. Reactionary over-alerting will not make anything better.
For developers, or anyone joining on-call (and probably those on-call already) our next big fear motivator is Fear Of Losing Sleep. Particularly, fear of losing sleep over trivial or unactionable things. If I summarized the dozens of conversations with developers I’ve had over the years, I’d say the #1 concern is that they are going to be expected to do things on-call that they don’t know how to do. Several comments from the 2017 State of On-Call survey further illustrate this concern:
“We have few pages during a rotation week. We are only routers though, and need to connect with other teams to actually do anything the needs to be fixed.”
“I get paged for issues that I can’t resolve; most of my time is either researching a problem that is transient or non reproducible, or contacting a vendor.”
“I’m on call for everything from infrastructure issues (that is ok since I’m the owner of that) to, much more often, application issues for software I don’t write and I don’t own (and most of the times our suggestions for improvements are ignored)”
Unpacking the problem, we can see that an alert’s actionability has to exist on two dimensions. First, the alert must require an action, differentiated from something that’s merely informational. Second, the alert must route to someone who has the access, permission, and skills to be able to adequately perform said action.
If either of those things are untrue, you’re going to have a bad time.
I think one of the most common examples of this being done badly is setups where the Ops team receives all incidents, and is asked to “just escalate” when the source of the alert is outside their domain. CPU load is high on a host, the offending process is the application itself, and Ops needs Devs to triage and repair. Why did Ops get involved at all?
A different example illustrating common frustration running the other way: Dev team receives a page indicating application health checks are failing. Ops has restricted Dev from tools or systems necessary to action the alert (root, shell access, access to Splunk, etc). Great! Guess we’ll wake an admin to fix the problem.
Depending on where your team specifically falls on this prickly issue, there are some simple and easy to implement things you can do to get better.
First and foremost – ensure you have some implementation of “teams” configured in your Monitoring or Incident Management systems. Teams alone is not enough though, you need to ensure that alerts are properly routed to those teams.
I touched on this briefly in the ways and means post, but let’s dive in a bit deeper. Let’s assume we’ve got four functional teams, and an escalation team comprised of architects and leads from each of the functionals (and the default “everyone” team).
Here’s our teams, and our Routing Keys as configured in VictorOps:
Then, in Nagios, we define new contact groups that match the route keys above:
alias Systems Team
alias Network Team
And finally, tie a host or service check to the proper contact group/route key/team to ensure that the correct team gets an alert for a specific, actionable incident related to their skills or roles.
Keep in mind this is a simple example to demonstrate a deeply meaningful approach in Monitoring and Incident Management: functional grouping and routing. Whatever your Monitoring and Incident Management systems actually look like, aligning systems and teams is key to keeping Alerts Actionable.
The Process is the Way
If you accept the myriad of things in-play, you must accept that there is no single, static answer to the problem. Similarly, you have to recognize that where your team nests is only a point-in-time. Every Postmortem should include some discussion of the alerting and routing associated with the incident:
- Was it Actionable? or just Informational?
- Did it get to the right team?
- Did they have tools, access, and skills necessary to action a remedy?
Inevitably that answer will occasionally come up “no”, and that’s where you, the on-call team, must embrace personal action to improve the system. Continuous improvement only happens when we communicate in the clear, and devote time to iterations on tuning and refining these systems.