We saw this image pop up on Meme Generator last week. While we were honored to be included in our first meme, the message of the meme was a little disheartening.
Clearly, one of our customers needs help sorting through the noise and getting to what’s important. The proliferation of great monitoring tools in the past few years has made it almost trivially easy to put instrumentation on every moving part in your platform or application. VictorOps is partnering with new monitoring tools every day, adding to the amount of data in your timeline.
But how much is too much? What’s the best strategy for capturing all the data you need while avoiding alert fatigue?
Achieving a peaceful night’s sleep doesn’t necessarily mean reducing the amount of data you’re gathering, or giving up on uptime. It is an iterative process of identifying which data points really represent actionable issues, which ones are redundant or overlap, and which ones are valuable only for information, and are never actionable.
In the past, we’ve talked about the benefit of holding a “handoff meeting” when on-call rotates. This gives the person coming on-call a heads-up on issues they’re likely to see, and the whole team a chance to hear about issues they may have missed during the previous on-call shift. This meeting is also a great opportunity to look at post-mortem reports and talk as a group about if there were any alerts or incidents that shouldn’t have been created. Is there a threshold in need of adjustment? Is there a service that reliably goes offline for 30 minutes a day but doesn’t affect the rest of the platform? This is a great opportunity to make those adjustments.
Other ideas for avoiding alert fatigue:
If your monitoring software supports it, consider using “time periods”. This is a feature in Nagios, Icinga, and other tools that allows you to only send alerts for certain services during certain hours. For example, you might want to suppress alerts from your staging or development environments after-hours.
Use Teams and Incident Routing in VictorOps. If you know that someone from the database team will need to respond every time service “x” goes critical, then there may not be any reason to bother the on-call ops person. Use Routing Keys to get alerts about specific services to the right people, and away from everyone else.
Focus on alerting for functional and integration tests. Lindsay Holmwood discusses this in an excellent blog post that compares operations monitoring to lessons learned in the health care industry. Maybe it’s okay that one of your servers is running a high load average, as long as you’re still processing user requests at full speed. Disk space, memory and CPU utilization, bandwidth – these are all metrics that are critical to follow, but it might make sense for your organization to keep that data in graphite or another tool where you can watch the long-term trend, and save the alerts for when something is really not working.
What works for one organization won’t always work for another. But there’s room in every organization to optimize monitoring and improve the signal to noise ratio. Fewer alerts means fewer distractions, and the ability to stay focused on the issues that really matter.