VictorOps is now Splunk On-Call! Learn More.
As we all know, development practices in DevOps rely upon continuous feedback and constant analysis. This is done to ensure both the timely release of quality software and continuous improvement to the processes driving development. In many ways, these ideologies regarding analysis also hold true for bettering the incident management procedures employed by an organization.
Incident management is crucial to delivering and maintaining quality software. But, the effective implementation of an incident management process doesn’t typically formulate overnight. With that said, where else is there to turn but to the data documenting an organization’s incident monitoring and response operations?
Below, I’ll discuss strategies for managing and contextualizing this data. I’ll show how effective analysis of this data can assist in deriving insights that help improve DevOps processes within an organization.
Proper analysis and use of incident data can be achieved by following many of the same guidelines for analyzing any DevOps process. Consider the following practices when looking to gain useful insights from incident monitoring and response data:
Every organization is acutely aware of where they can improve. Spend some time evaluating where your particular issues reside in the context of your incident management strategy. Doing so will help you identify the incident monitoring data crucial to successful analysis and the noisy data that should be ignored.
Conclusions drawn from data analysis aren’t useful if you can’t act upon them. Analyze incident data with this in mind. Separate out the noise wherever possible and focus on data that provides real insights. Then, use that incident data to better the process.
When looking to improve processes with data analysis, determine which insights require you to take action. A simple guideline is to act upon the incident data that has the greatest impact on the process as a whole. Are 20% of all reported incidents related to one core problem? Fix that problem and the workload is reduced. Is one major flaw in the incident management process responsible for constant delays that hamper the organization’s ability to respond in a timely fashion? If so, fix that flaw. Simply put, evaluate the knowledge gained from data analysis and prioritize the conclusions – putting those that most influence the process at the top of the list.
As is true with any other DevOps process, improving incident management practices will be a continuous challenge. Routinely perform data analysis to constantly refine and update your strategies over time. Rome wasn’t built in a day – and neither was a successful incident response plan.
One major step in understanding how you can use incident monitoring data to your advantage is to understand the metrics you have at your disposal — and more importantly, to understand how you can combine these incident management metrics to draw insights that better your processes. But, there are three key measurements that come to mind when considering the efficiency of an organization’s incident management strategy and surrounding development practices:
Or, in other words, the volume of reported incidents for a given system. This is a metric that can prove valuable to an incident management team. Breaking down the total number of reported problems by the affected system will provide visibility into which systems are most stable and which are most unstable. By understanding the stability of the various supported systems and applications, an organization can make better use of its time.
For example, identifying unstable applications using incident data may lead to the implementation of additional testing to root out these problems before they make their way into a production environment. This reduces the negative effects of said incidents on end-users and decreases the amount of time the DevOps team needs to spend supporting existing functionality. The organization now has more time to innovate.
The amount of time it takes to acknowledge and resolve issues can also provide valuable insights for where an organization can improve. Are issues being acknowledged fast enough by those on-call and assigned to support roles? If not, then maybe the problem is with the organization’s process for alerting support personnel. Configuring incident management software to alert more frequently or more effectively (i.e. directly to the right person, text alerts vs. email alerts or through integrations with other collaboration applications) will assist in notifying the necessary team members in a more time-efficient manner. Undoubtedly, this will lead to faster resolution of critical issues.
Which person or team is seeing the greatest amount of incidents sent their way? Answering this question can lead to improvement as well. If one team in particular is treading water just to keep things up and running, they may need additional support. Consider the scenario where one team is constantly playing whack-a-mole with reported issues to keep a system functioning. Do they have time to improve the system and reduce the number of future problems? The answer is likely no.
A reorganization of incident assignments, on-call escalations and/or temporary assistance from additional staff may help to keep these issues at bay. This allows the people most familiar with the system to spend more time performing post-incident reviews and bolstering weaknesses to cut down on the flow of incidents. In the end, this will result in a more stable system with less frequent outages.
As is true with all software-related practices, data analysis is critical to the improvement of incident management procedures. Effective analysis of incident monitoring and response data can increase visibility into the challenges that hamper an organization. These insights can lead to great leaps in incident management efficiency, as well as improvements to other DevOps processes.
Centralize incident data with on-call schedules, automated alerting and collaboration tools with an incident management solution like VictorOps. Sign up for a 14-day free trial or request a free personalized demo to learn exactly how we make on-call suck less for DevOps and IT teams.