VictorOps is now Splunk On-Call! Learn More.
Two categories a software organization should always strive to improve in are application quality and incident response. Data analysis is one manner in which such an organization can improve the efficiency of incident management and overall application quality. However, the questions still remain – which metrics should be collected and how can analysis of these metrics facilitate these improvements?
Read on to hear about five key incident response related metrics. Discover how these metrics and incident management KPIs can be leveraged to provide insights that add value to your customers; both in the quality of your application and the efficiency of your incident response strategy.
Generally speaking, DevOps and IT organizations have two options when it comes to providing a better customer experience with their product. The first option includes making application changes to improve the quality of the application/service or implementing new features that provide value to the customer. The second choice is to improve the process for incident management to quickly and seamlessly resolve issues encountered by the customer. Here are some metrics that can help an organization take these steps more thoughtfully:
The first metric to be analyzed may also have the potential to be the most impactful. Commonly reported errors or performance issues should be tracked and reported back to the development team for root cause analysis via thorough post-incident reviews. Repeated failure of the same functionality will likely trace back to the same root cause which, when resolved, could fix the problem for good when moving forward. By extension, application slowness may be the result of improper query construction and simply optimizing these queries could lead to better performance and happier customers.
The time it takes for an incident response team to acknowledge a reported incident can reveal a lot about the effectiveness of an organization’s incident management process. While the acknowledgment time for any particular incident may not be indicative of a trend, calculating the mean time to acknowledgement (MTTA) can help an organization determine if their incident management strategy needs to be altered. A better incident management strategy can facilitate faster response times and let customers know they’re not forgotten. These alterations could include the setup of additional or repeating, time-based alerts to inform the necessary incident response personnel of newly-created issues – ensuring faster acknowledgment and fewer gaps in on-call coverage. Another possibility may be the scheduling of additional on-call staff or restructuring of current schedules, ensuring adequate staffing to handle the volume of issues.
Similarly, another important incident response metric to track, in an effort to increase customer satisfaction, is the time to resolution for reported incidents. The goal, of course, is to resolve incidents as quickly and efficiently as possible. Calculating the mean time to resolve (MTTR) and the average time to resolve for particular issues can provide insights that suggest where the organization can focus on improving their incident response strategy. It’s possible incident response personnel require more training on certain topics or systems, or that changes should be made to procedures for documentation and communication. Appropriate efforts should be made to facilitate collaboration amongst staff to resolve issues in a more efficient manner.
Tracking the report time of each incident can lead to the detection of some important trends as well. Is application slowness commonly detected and reported on Monday mornings? Maybe traffic to your application is significantly higher at this particular time and scaling is necessary to permanently prevent this problem from occurring. Did an issue start presenting itself after a particular deployment? Knowing this type of information provides insight that allows the development team to track down problems quickly and more easily.
Are incidents frequently being escalated or rerouted to different units within the organization? If this is the case, there would likely need to be some alterations made to the incident response strategy. These changes can range from slight adjustments to the alerting process to try and inform the correct personnel of relevant issue occurrences in a more timely manner. Or, such adjustments can be more significant, involving an overhaul to the issue classification process to provide the team with more granular detail – increasing the likelihood of the right people being the first to tackle the problem.
With the information provided above, it’s fairly easy to see how collecting and analyzing incident response metrics can improve the incident management process and enhance application quality. But, why is this so important?
To see the importance of this data analysis for any business, you need not look further than online retailers, financial institutions and social media companies. A positive customer experience can mean the difference between being known as the go-to platform in a particular industry or being completely irrelevant. Slow incident response times and frequent application issues can quickly sully a company’s reputation, putting them in the position of fighting an uphill battle against the rest of the competition in their industry. In contrast, reliability and prompt issue resolution can help cultivate trust between an organization and its customer base, leading to recurring customers and a positive reputation that draws in new customers.
See how VictorOps is helping DevOps and IT teams facilitate better data-driven incident response with detailed reporting, alert automation, collaborative integrations and machine learning. Try a 14-day, free trial now.