VictorOps is now Splunk On-Call! Learn More.
When it comes to incident management, classification of alert severity is highly important. If every alert was marked as critical and notified on-call engineers in the same fashion, you’d find yourself with a highly fatigued on-call team. By continuously adjusting monitoring tools and thresholds, centralizing incident data, and conducting thorough post-incident reviews, you can optimize alert severity classifications and reduce alert fatigue.
Don’t treat every incident equally. By establishing intelligent thresholds and defining alert importance, you can build awareness of potential problems before they become full-fledged incidents. Improving visibility and alert classification leads to a better holistic understanding of how your system functions.
Every team is structured differently and every service’s infrastructure is organized differently. So, there truly is no one-size-fits-all approach to classifying alert severity. But, there’s some general logic that can be applied to building incident response processes based on incident severity. So, I wanted to walk through some types of incident classifications that may help you prioritize and respond to alerts in a timely manner.
Critical incidents will cause negative effects for your end users. Critical incident severity will be prioritized above anything else and will need to be resolved ASAP. Alerts need to be acknowledged and escalated before you start addressing any less severe incidents. Typically, critical alerts should notify on-call engineers in numerous ways in order to get their attention more quickly. These critical incidents are the only types of alerts that should be waking somebody up at 4 AM.
Moderate incidents likely have a minor effect on a small number of end users. These are incidents that are important to resolve, but typically don’t need to be quickly patched up. This allows you to spend more time finding a full solution to the incident, not a patchwork resolution. While critical incidents may be something such as an outage for an entire feature, a moderate incident is likely an outage within one small part of a feature.
Non-critical incidents are rarely reported by end users. They’re small glitches or optimizations to the user experience that typically go unnoticed by end users. Non-critical incidents can include certain capacity planning work or may be a smaller issue that’s a side effect of a larger incident.
When receiving alerts, automation can be used to immediately provide incident severity in-line with an incident’s contextual data. By setting levels of incident severity, you can organize project management and prioritize incident workflows. A deep dive into your infrastructure and the associated on-call schedules, paging policies, and escalation policies can surface the best ways to set incident severity levels.
When you understand what constitutes a critical incident, you can work backward to build better incident workflows. Continuous improvement of monitoring thresholds, alerting methods, and collaboration techniques allow you to maintain higher levels of uptime and build more resilient systems. Setting incident severity and clearly stating the actions to be taken for each level of severity
With severity levels in-line and integrated into your incident management solution, you can better prioritize workflows and remediate critical issues faster. Also, with alert routing and automation, you can deliver this important piece of context to the right person at the right time. With severity built into your alerting and response workflows, incident remediation speeds up, and on-call sucks less.
Homegrown incident management solutions simply won’t cut it when it comes to building on-call workflows. Learn more about key functionality in our free Incident Management Buyers Guide to start making on-call suck less.