VictorOps is now Splunk On-Call! Learn More.
Remember the WUPHF app from “The Office” Season 7 Ep 9? With the advent of notifications, for a long time, it seemed like the best path for any alert was to get all the notifications in all ways, Woof! Still today, the typical NOC operates on the “Woof!” notification strategy. Where everyone in the company receives a notification for all alerts, in all the notifications methods. Which most of the time is just email. This one-to-many strategy for alerting and response is broken, and we all know it.
Setting up email groups where anyone can subscribe and see what’s going on at a glance seems convenient for several reasons:
1) Data blindness. Even the most diligent of us are guilty of becoming so familiar with data inputs that we become immune to them. The reason this happens is because they cry wolf. Most of the time, the alerts that show up are either irrelevant to you or just informational. Because, along with relevant alerts, these catch-all methods demand that alert triggers be over-inclusive so that nothing is missed.
2) It puts the value burden on the user. The consumer of the information has to decide what’s important or not. It’s a lot to expect that someone, in a sea of noise, will always be able to accurately determine the criticality of each alert and know when it’s something that needs to be acted on.
3) It doesn’t correlate with downstream activity. You can embed a lot of initial context in the email alerts. But, the activity that follows the alerts is lost because as soon as someone acknowledges the email, the response activity is lost. Even worse, it’s hard to know what the acknowledgment was. In modern infrastructure and applications, what happens after acknowledgment can substantially change the context of the incident.
History. Still in many organizations, email is a primary system of record, even though it wasn’t designed for that. This means that email can serve as a source of documentation, history and audit trail for everything that’s going on. So, having incident data in an email system is useful, but not for real-time action, just for record-keeping.
Stakeholders. Having inboxes for those who are impacted by incidents but shouldn’t be brought into the firefight is useful for improving transparency. It brings awareness to stakeholders without the expectation that they need to engage. However, it still suffers from the primary issue of incident blindness.
There’s a good chance that even the most advanced tech companies have incidents ending up in email inboxes in a one-to-many fashion. But, it’s not overly effective and certainly shouldn’t be the goal for any organization. The better approach to IT alerting is one-to-right.
One-to-right shouldn’t be confused with one-to-one. Addressing incidents in modern IT usually involves several experts. One-to-right means that your incident response strategy and tool is designed to get alerts to the right on-call person or team and give them the chance to quickly add the right responders to the firefight.
The way to do this is to first have a strategy. No tool is going to make you successful out-of-the-box without an understanding of how your teams are structured, how team member expertise relates to your technology stack and how you define alert rules and escalation policies for finding the right people at the right time.
Then, the tool itself needs to be architected to be a one-to-right system. One-to-many alerting platforms exist and are powerful in mass-notification scenarios where you’re informing large audiences of critical events where the outcome is for them to be aware or take a prescribed action. In modern infrastructure and application development, the result of an alert is that experts need to take action on the information they have in order to quickly solve problems. The payload of the alert is informational and meant to guide the responders in the best way possible for the fastest mean time to acknowledge and remediate (MTTA/MTTR).
One-to-many and one-to-one are useful in their own rights. But, when it comes to efficient incident response with a one-to-right strategy, automation and collaboration tools are necessary for making sure your mean time to acknowledge (MTTA) is not only fast but is sent to the best person. Naturally, one-to-right IT alerting gets the right person or team involved, surfaces actionable context faster and drives down mean time to remediation (MTTR).
Start surfacing alert context and mobilizing the right responders faster with VictorOps. Try a 14-day free trial or sign up for a free personalized demo to learn how integrated on-call schedules, alert automation and detailed incident context is making on-call suck less for DevOps and IT operations teams.