Reducing MTTA (mean time to acknowledge) doesn’t happen overnight. It takes time and it takes investment in your people, processes and technology. DevOps, SRE and IT teams need to work together to improve the observability of systems and constantly iterate on workflows. A combination of effective monitoring, alerting and collaboration tools – in association with efficient on-call incident response processes – leads to less downtime, reduced MTTA and better customer experiences.
To kick off this five-part series, we’ll first discuss ways in which you can reduce MTTA through improved alerting. By the end of the series, we’ll walk you through ways to improve upon five steps of the incident management lifecycle – alerting, notification, response, escalations and post-incident reviews. By showing you how to optimize your approach to each of these steps, we’ll help you lower MTTA and MTTR over time.
Each post includes steps anyone can take to reduce MTTA – whether you use VictorOps or not – and a video highlighting specific VictorOps services to help you level-up incident response and make on-call suck less. To start, I’ll explain how to reduce MTTA by creating more actionable alerts with the sophisticated VictorOps alert rules engine, the Transmogrifier.
Follow along with the rest of the series to reduce MTTA at every step of the incident lifecycle:
The costs of downtime
Service and application downtime costs add up quickly. The Rand Group put out a study stating that 98% of organizations said a single hour of downtime costs over $100,000. And, above that, 33% of those companies said that one hour of downtime would cost somewhere between $1 - 5 million.
Building and maintaining reliable services is only half of the equation. In the world of CI/CD and agile software development, incidents are bound to occur. So, DevOps and IT teams need to be prepared for the worst. Rapid incident response and constant improvement to alerting tools will reduce MTTA/MTTR and the long-term costs of downtime.
So, let’s dive into a few ways you can reduce MTTA by improving alerts and providing context to incident responders faster.
Optimizing alert payloads
An alert is really only as good as the information provided with it. With many homegrown alerting and on-call solutions, you simply receive alerts via SMS or email – with no additional notes. With incident management software like VictorOps, you can automatically change the alert payload as it comes into the timeline. This way, your monitoring tools are serving the exact information you need, when you need it.
The VictorOps rules engine, the transmogrifier, has numerous capabilities – it can attach annotations for alert context (e.g. runbooks, charts, logs, conference bridge links, etc.), change the alert payload as it comes into VictorOps and customize how alerts are routed and prioritized. Every alert is processed by the rules engine, in order from top to bottom, meaning the order of rules is very important.
By surfacing context more quickly and automating much of the alerting process, your on-call engineers can quiet alert noise, respond to an alert more quickly and collaborate around important incidents with helpful context and resources. You can better prioritize alerts from different monitoring solutions and make sure alerts are served to the right person at the right time. This reduces MTTA/MTTR over time and creates an on-call experience that doesn’t suck.
Self-healing systems and alert waiting rooms
At an even higher level, the transmogrifier can be used to create “waiting rooms” for incidents that are likely to self-heal and can be used to escalate those alerts to humans if they don’t self-correct. This can help silence the noise and allow your team to page out to on-call responders only when it’s necessary. If you notice recurring issues commonly self-correcting or you have specific instructions that can be used to remediate repeated alerts, you can easily leverage the transmogrifier to cut through the noise and reduce MTTA.
Automation and context are at the core of reducing MTTA through alerts. With a piecemeal alerting system, you’re simply getting alerts from your monitoring tools without any context or prioritization. A sophisticated rules engine can automatically serve helpful resources with alerts and ensure the alerts are served to the right people at the right time. Leveraging the automation and collaboration resources already at your fingertips can make on-call less stressful and reduce MTTA – leading to more reliable services, happier customers and happier employees.
In part two of our series, we’ll cover how you can lower MTTA with improved notifications and personal paging policies.
After improved alerting, check out all of the other ways you can reduce MTTA at each stage of the incident lifecycle:
Want to try out transmogrifier and lower MTTA/MTTR while simultaneously making on-call suck less? Sign up for a 14-day free trial or register for a free personalized demo to learn more about VictorOps on-call incident management software.