Will La - September 27, 2016
The more I speak with people who deal with a flood of incoming alerts, the more I see why the traditional on-call role has such a high rate of burnout. People in the operations role are expected to monitor systems and maintain nearly 100% uptime. 99.9999% at the very least.
If each monitoring system has its own fancy version of simple alerting, then in the spirit of not wanting to miss a beat, the person watching the systems receives simple alerts from a multitude of solutions.
Now, the person watching the systems has to decipher the importance of these incessant alerts. Which are critical? Which are less urgent? Which are related to each other? Enter ridiculous noise levels and alert fatigue.
Incident handlers need the ability to prioritize which of these out-of-context alerts actually means something. Their job depends on these mental gymnastics, which often leads straight to fatigue and burnout.
When someone flees to a new job to alleviate the pain, they often discover that the grass is not greener on the other side. Sometimes the old, terrible situation was an annoying pet parrot compared to the firestorm of alert bombs they get in their new position. Sub-alert shrapnel floods their inbox, coming from tools they have never heard of.
Why do we allow this?
I’m writing this blog post as a public service to reduce alert noise. The goal is to make alert noise reduction an ongoing effort. When you allow alerts to continuously make noise, you numb the team and distract them from taking the signal seriously and focusing on the issue.
Lessening the noise is for the well being of your team’s health, your organization’s costs, and your own education–to see what’s really going on in your technology stack. The remedy: bucketing associated alerts into a single incident via the concept of alert aggregation.
Start by looking at alert data, and begin to think about how you can group alerts to make them more manageable. Review the payload information and the alert field values to find trends, matches, or similarities in the strings.
From there, begin thinking about how to aggregate alerts into the smallest instances where alerts are related. Sometimes monitoring tools can handle this work with some fine tuning. Other times, you need to look across tools and the frequency of different alerts to tell a separate part of the incident’s story.
You may realize that separate monitoring tools may not be the right place to tune alert information. You may also conclude that aggregating alerts into a single instance may heavily reduce alert noise. If so, you are ready to begin the aggregation process.
Some people have found ways to write and maintain custom python scripts leveraging an API to ingest alerts and include logic that sends out specific actions based on identified fields. Other people simply deployed a system like VictorOps.
VictorOps automatically aggregates alerts into a single incident based on a field that we call the Entity ID. This Entity ID is automatically generated based on a multitude of factors. We key off of this information to aggregate the alert.
If an alert comes in with an Entity ID that does not match any open incidents (those that are triggered or acknowledged, not resolved), then VictorOps creates a new incident with this Entity ID. If another alert with the same Entity ID is created, we aggregate the alert with the existing open incident to reduce noise. And if the alert comes from a very noisy service, we continue to aggregate alerts as long as the incident is in an open state.
Through this aggregation process, I’ve seen teams quickly snap to attention at the first alert of hundreds, simply because they were all associated with a single incident. Imagine how much easier it would be if your alerts were aggregated like this.
Once the incident is resolved, the Entity ID is now free and available. The next time an alert with that Entity ID arrives, it will appear as a new incident. Simple as that.
Many VictorOps customers leverage the transmogrifier within the Incident Automation Engine to transform multiple alerts with different Entity ID values into a new value. They use the new value to match and aggregate alerts across different sources or types.
Here are some examples:
Hosts .101 through .104 are part of a virtualized cluster, so we want to aggregate them into a single incident. Since the alerts from each host come in as individual alerts, we can see the source from within the payload, but we’re able to manage them as a single incident.
Each time Port 443 acts up, it generates a new alert because the Entity ID keeps changing due to a variable timestamp value. We use the Incident Automation Engine’s transmogrifier to transform these Entity IDs into a consistent value via wildcard matching on the variable timestamp in order to remove it.
We use Nagios for a very specific service. So we aggregate all alerts from Nagios into a single incident where the dedicated DevOps team can take ownership during alerts.
I still stumble across VictorOps users who didn’t know about this functionality at first. They initially used VictorOps for on-call routing purposes but they noticed the noise reduction right away from this built-in feature. Others are surprised when I tell them about the aggregation and they wonder why we don’t put this on our front page.
Well, it exists. It was built to help people have a better experience when dealing with alert noise. Be sure to dive into how this works. Most important, get the word out that alert fatigue can be addressed! People don’t have to live this way!
Follow Will La on Twitter at @WillLaThoughts.