VictorOps is now Splunk On-Call! Learn More.
In many organizations, DevOps, IT, SRE and operations teams can become laser-focused on reducing MTTA through improvements to real-time collaboration and visibility. While optimizing the way teams consume monitoring data, share information and communicate during a firefight, detailed post-incident analyses are equally helpful. Without analysis, your team may be extremely fast during incident response but they’re still imperfect because response is only half the equation.
Part four of the reducing MTTA series will cover post-incident reviews. Post-incident reviews are used to surface system, process and people concerns. In an incident management solution like VictorOps, you can curate all the information you need in a post-incident review. You can combine real-time communication and alert data with charts, logs and runbooks in a single-pane-of-glass to paint a detailed picture of how an incident occurs.
Before going deeper, I’ll talk a little bit about conducting post-incident reviews with VictorOps and some of the other reporting information you can leverage to continuously improve and keep making on-call suck less.
If you haven’t seen the first four parts, don’t forget to check out the rest of our reducing MTTA series:
A lot of engineering and IT teams find themselves constantly tweaking monitoring thresholds and discussing the tools and technology behind their services. An understanding of your system’s make-up and interdependencies is essential to building reliable products. But, acknowledging how these functions affect the people on your team is how you take the next step toward more resilient applications and infrastructure. Post-incident reviews can’t only account for technology and process – they need to consider the human element.
Compiling chat history, conference call recordings, etc. with system monitoring and alert data will lead to more comprehensive post-incident reviews. Over time, you can see exactly where alerts are getting lost and why. Not only will you see why a system is behaving the way that it does but you can see how the team is responding to issues in your service. Then, you can add automation and transparency to key steps in the incident lifecycle to improve collaboration and reduce MTTA/MTTR over time.
At a high level, the key metrics for incident management will always be MTTA and MTTR. While you are constantly striving to lower the time to acknowledge and resolve an incident, there are numerous metrics beneath those that you can track. If you only track MTTA and MTTR, you aren’t allowing yourself to see all of the variables.
So, here are a few suggestions for other incident management KPIs you can monitor to ensure incident response and remediation is becoming more efficient over time:
How much time was spent in each phase of the incident lifecycle? (detection, response, remediation)
Establish an incident timeline
What tasks were accomplished or what commands were executed during incident response? Who executed these steps and when? How did these responders decide to take these actions?
Track communication that occurred during the time of the incident and report on it
Leaving your team in darkness leaves everyone blind. And, if your team works in silos and don’t share information cross-functionally, everyone isn’t only blind – but they’re split up into separate rooms. Adding transparency from beginning to end of the software development and delivery lifecycle leads to improved service reliability, delivery speed and collaboration. You can show exactly how people are working and where problems lie in your organization – whether those problems are systematic or human.
But, with added transparency to workflows, you need to encourage a blameless culture. If somebody makes a mistake and it causes a bug, you can’t simply point fingers and blame one person for causing the issue. Maybe the developer wasn’t given the proper information when writing the code. Did QA accidentally skip a test that could have caught the issue? You’ll find that most incidents in software development and IT aren’t caused by one single point of failure. And, if it is, you likely need to adjust something in your workflows.
People need to be encouraged to communicate openly and address issues, not hide them. In a culture of blame, teams are more likely to hide issues instead of surfacing them and collaborating to fix the root cause of the problem. Taking the time to conduct blameless post-incident reviews will ultimately improve collaboration and transparency across the entire incident lifecycle – improving service reliability and leading to better customer experiences and greater business value.
In case you missed it, check out the rest of our reducing MTTA series to identify actionable ways you can make on-call suck less and drive efficiency in incident management:
Conduct thorough post-incident reviews by centralizing incident details in a collaborative on-call incident management tool. Sign up for a 14-day free trial or check out a personalized demo to learn exactly how VictorOps makes on-call suck less.