VictorOps is now Splunk On-Call! Learn More.
The discussion about incident management tends to focus on what happens in real-time, when an incident is actually occurring. To a degree, that makes sense; identifying and responding to incidents quickly is a core component of effective incident management.
But, it’s only one component and if you focus too narrowly on response itself, you miss out on the broader value that effective incident management provides. The best incident management strategies also include reporting and analytics features, helping you understand what happened after the response period is over. Only by implementing and leveraging incident KPIs, effective reporting and analytics processes can you prevent the same types of incidents from happening again and again, fully optimizing your incident management strategy.
In incident management, reporting and analytics refers to any type of data you use to investigate or understand an incident after the incident has been closed. It’s distinct from data used while your team is in the midst of identifying or responding to the incident.
Incident reports and analytics come in many forms. Some common examples include:
A post-mortem report on a specific incident. This would typically include information about who responded, the escalation paths the incident followed and how and when the incident was resolved.
On-call reports. You can generate reports showing what your engineers do when they’re on-call. These on-call reports would include information like how many incidents the average on-call engineer handles, how on-call duties vary between different days or times of day, and so on.
Incident metrics. Averaged data about how often incidents occur and how long it takes to acknowledge and resolve an incident (MTTA/MTTR) are common metrics you can track with incident reporting and analytics tools.
Again, after-the-fact reporting may not be the most exciting facet of incident response. We live in a world where we’re taught that real-time action means everything. Indeed, it seems fair to say that, in IT culture, terms like reporting conjure thoughts of boring paperwork you have to file in order to keep bureaucrats happy, as opposed to work that actually helps you advance your core mission. There’s a reason why Peter Gibbons hated TPS reports so much in Office Space.
But, incident response reports aren’t just meaningless paperwork. In fact, they’re far from it, they’re a critical resource for helping incident response teams do their jobs more effectively and making their jobs more pleasant.
Perhaps the only thing that IT engineers hate more than pointless paperwork is tedious manual tasks they have to repeat over and over again.
In incident response, repetitive tasks are what happen when you have the same types of incidents occurring over and over again. Effective reporting and analytics are critical for avoiding this phenomenon because they help to reveal long-term problems or trends that might not otherwise be obvious to on-call engineers who are too busy in the trenches to see the bigger picture.
For example, an incident frequency report can show you which types of incidents occur most often and the trends associated with them. If the most frequent incidents impact a certain type of system or happen at a certain time of the week or the month, your team can take action to address the root cause and prevent the incident from recurring indefinitely.
This leads to happier team members – not just because they have fewer incidents to worry about but because they’re also not doing the same things over and over for no good reason.
Without reports, it’s a challenge to communicate information that helps the entire team understand an incident or a series of issues. This is especially true when it comes to incidents that involve multiple responders, not all of whom are working together or at the same time.
Indeed, it can be easy for a team to underestimate the seriousness of an incident when responsibility is split among multiple engineers. If each engineer only handles a small part of the response, the team can come away with the impression an incident wasn’t actually a big deal – especially if they never sit down and analyze the incident collectively, which most teams don’t do unless their analysis is driven by a report.
In this way, a post-incident report can help teams gain a more accurate understanding of what actually happened during incident response and how the team as a whole can collaborate to prevent the incident from happening again, or address it more effectively if it does.
Everyone likes to feel appreciated. Few people do when they’re on-call at 2 AM and no one else is around to see how hard they’re working. And, by the time the rest of the team comes into work in the morning, the on-call engineer from the night before has probably returned home, possibly leaving little trace for his colleagues to see what happened during their on-call shift.
This is why on-call reports are so useful for giving credit where it’s due, especially to team members whose work might otherwise go unrecognized. Instead of trying to measure effort in an ad hoc fashion by looking back through incident logs from time to time, on-call reports generate a systematic representation of who was doing what to respond to incidents while the rest of the team was focused on other things.
The value of the work the incident management team performs can be tough to communicate to people who aren’t involved in incident response. They take it for granted that most things keep working and they tend to pay attention to incident response only when something goes wrong.
This tendency creates something of a paradox wherein the value of the incident management team is appreciated only when it falls short and something critically breaks – which ideally rarely happens, because effective incident response means identifying and resolving problems before they become critical disruptions.
In the face of this challenge, reporting and analytics help to demonstrate the value of incident management in a more positive and comprehensive way. Instead of being able to say what didn’t happen – things like “no servers went down this month” or “we haven’t lost any critical business data in over a year” – reports that include data about incident frequency and mean time to acknowledge and resolve can be used to demonstrate how much work the team actually did.
Continuous improvement is a term that gets tossed around a lot in IT these days – it’s one of the mantras of the DevOps movement. ITIL 4 places some emphasis on it, too.
Yet, the tricky thing about continuous improvement is that it can be hard to quantify. It’s easy to say you’re continuously improving things by adjusting your processes or your tools. It’s harder to prove that the changes you make are actually leading to improvement as opposed to mere change.
This is why the ability to report and analyze data like mean time to acknowledge and mean time to resolve (MTTA and MTTR) are so valuable for tracking and demonstrating continuous improvement. It’s the only way to know that tweaks you make to your incident management strategy are leading to positive change over time. Being able to track these trends is valuable not just for helping the incident response team do its job better, but also justifying investment in incident response to business stakeholders.
If you work in IT or DevOps, you may be naturally inclined to dislike anything called a report. But, when it comes to incident management, reporting and analytics are key to optimizing incident response strategies, helping your team do their job more effectively and enjoyably, and proving to stakeholders just how much value incident management brings to the business.
See how detailed reports and analytics feed into the VictorOps on-call incident response solution to help DevOps and IT teams continuously improve alerting and incident management. Sign up for a 14-day, free trial today.
Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure and networking. He is Senior Editor of content and a DevOps Analyst at Fixate IO. His latest book, For Fun and Profit: A History of the Free and Open Source Software Revolution, was published in 2017.