World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
Will La March 30, 2017Monitoring & Alerting
Monitoring tools are great. Here at VictorOps, we are constantly rolling out new integrations with monitoring tools and without them, VictorOps wouldn’t have much to work with. They enable you to check system health every few minutes and often alert you in the same way: by sending an email or notification every time a check finds a failure.
If you haven’t set up alert dependencies in your monitoring systems, this can become noisy. In cases where you have configured your monitoring systems to check system health every five minutes, if unacknowledged, you would receive five separate alerts for a 25 minute “outage.” Multiply that by related configurations, monitoring rules, services, and erroneous alerts to add to the noise. Oh, and that is for only one monitoring tool; companies tend to have multiple monitoring tools in place.
Using the logic above, you can find yourself in a situation where you are receiving 1000 alerts a day. In some cases, there are organizations where people are watching a shared alert-email inbox, seeing alert flow in the form of eye-numbing subject lines constantly coming in. It sounds excessive, but I have seen worse scenarios with my own eyes. In those cases, VictorOps helps customers who have this growing problem.
If you’re not a VictorOps user, the concepts I share here are still applicable. But know that this functionality is baked into VictorOps right out-of-the-box.
In our example, we are going to use our initial 1000 alerts for easier math. We’ll work our way down to making that only ten alerts.
Alert aggregation introduces the concept of an alert bucket that aggregates multiple alerts into a related event. Think about it. An outage is a single event with a start and a finish. You may find out about the event via alerts from your monitoring tools, which are designed to ask if the system is healthy and report back if the response was undesired. Then repeat after x minutes. Those repeated alerts may refer to the same original issue.
So we aggregate these alerts into “buckets” that we call incidents. When the first alert hits our system, it opens a single incident and allows future, related alerts to simply aggregate into said incident. The alerts will continue to aggregate into the incident, which enables you to focus on managing the incident, rather than be distracted on the alerts themselves.
VictorOps only pages out at the incident level (upon being opened or rerouted) and not just when an alert is being yappy. This default behavior in VictorOps greatly reduces alert noise.
To reiterate using our example numbers, we typically see about three to five alerts aggregating into a single incident. I have seen more where there are over a hundred, but that’s an extreme case. So let’s say we average four alerts per incident and you only “notify/alert” the user at the incident’s initial creation. That 4:1 ratio gives us a 75% alert-noise reduction right away, from 1000 to 250 alerts.
Routing is straightforward; instead of having those 250 alerts flow to a single inbox or group, let’s route them to the groups or people who are best equipped to handle them. You will want to build a structure where you can set up “routes” and then subscribe teams to these routes.
Once set, you can begin sending certain incidents along specific routes to go to their designated teams. The teams will then get alerts only on what is important to them, and they will not be bothered by alerts on items that are unimportant to their role or unrelated to their expertise. This completes our phase of alert-noise reduction.
If we were to configure ten routes for incidents to flow through, then let’s assume that each route receives 25 incidents that “notify/alert” those team members.
You can see how this is done in VictorOps by using our Routing Keys.
The final phase of alert-noise reduction is the ability to classify your incidents. You would want to:
The desired behavior for Critical incidents is very straightforward. It’s also nice to have the ability to build a different workflow and a separate paging format for Warning level incidents. There’s no reason we should be alerting on Info alerts, (storage at 80%….anybody???). Now we are able to focus even more on what is most important, and let the non-critical work be handled in a less urgent manner.
In our example we’ll assume that 40% of your alerts are truly critical, and the rest are either warning or noise. This brings our 25 alerts down to 10 alerts that really matter, that belong to me, and should only page me when it was first opened.
There you go. Let’s end this madness.