Tara Calihman - November 11, 2015
Not to give anything away from our recent State of On-call survey (results to be released soon!) but the single most significant issue facing on-call engineers today is alert fatigue. The majority of respondents said that alert fatigue is both an issue in their organization and the one causing the most pain around being on-call.
In addition to there just being too many alerts, there are other reasons that unactionable alerts may be causing unhappiness:
– Inappropriate alert thresholds – Alerts that are only relevant during certain times of day – Alerts that duplicate messages coming from another monitoring tool – Alerts that your department doesn’t have the ability to change – Noisy development environments – Planned maintenance – Patch cycles – Downstream alerts
We call it the Transmogrifier. This feature (named after a Calvin & Hobbes creation) has serious transformational power and allows you to append remediation information to an alert as it comes in. The transmogrifier assists with the hard job of maintaining continuous documentation, along with routing alerts to the right person every time. But as it relates to unactionable alerts, this feature can help to quiet the alert noise.
Based on key-value pairs within the payload of an alert, users can choose to ‘silence’ that alert by turning it into an ‘info’ or ‘warning’ alert. No more waking up engineers in the middle of the night for something that isn’t actionable.
Here’s where the value of the VictorOps timeline comes in: we’re not getting rid of the notification all together – you’ll still see it in your timeline as an informational item for forensics – BUT it’s not going to page anyone. Hooray!
We’ve got big plans for this feature but for now, we’ve created a few tools and encourage a few processes to make quieting alert noise something your team regularly practices.
– During on-call hand-off take a look at the Incident Frequency Report and see what alerts paged most frequently during the last on-call shift. Is there something you can do at the source to quiet the false alarm? If not, create a transmog rule! – Have on-call users report any unactionable alerts as part of your on-call handoff. Share with the team & work to update documentation. – Reward your team for communication and keeping a clean alert stream.
VictorOps is dedicated to providing its users with tools and best practices for continuous improvement. Your engineers build and maintain systems, we build happy engineers.