VictorOps is now Splunk On-Call! Learn More.

Best Practices for Contextualizing Incident Monitoring Data

Scott Fitzpatrick June 19, 2019

DevOps Monitoring & Alerting
Best Practices for Contextualizing Incident Monitoring Data Blog Banner Image

As we all know, development practices in DevOps rely upon continuous feedback and constant analysis. This is done to ensure both the timely release of quality software and continuous improvement to the processes driving development. In many ways, these ideologies regarding analysis also hold true for bettering the incident management procedures employed by an organization.

Incident management is crucial to delivering and maintaining quality software. But, the effective implementation of an incident management process doesn’t typically formulate overnight. With that said, where else is there to turn but to the data documenting an organization’s incident monitoring and response operations?

Below, I’ll discuss strategies for managing and contextualizing this data. I’ll show how effective analysis of this data can assist in deriving insights that help improve DevOps processes within an organization.

Best practices for contextualizing incident data to drive continuous improvement

Proper analysis and use of incident data can be achieved by following many of the same guidelines for analyzing any DevOps process. Consider the following practices when looking to gain useful insights from incident monitoring and response data:

Set initial goals for the incident analysis

Every organization is acutely aware of where they can improve. Spend some time evaluating where your particular issues reside in the context of your incident management strategy. Doing so will help you identify the incident monitoring data crucial to successful analysis and the noisy data that should be ignored.

Focus on drawing actionable insights

Conclusions drawn from data analysis aren’t useful if you can’t act upon them. Analyze incident data with this in mind. Separate out the noise wherever possible and focus on data that provides real insights. Then, use that incident data to better the process.

Act upon insights with the greatest impact

When looking to improve processes with data analysis, determine which insights require you to take action. A simple guideline is to act upon the incident data that has the greatest impact on the process as a whole. Are 20% of all reported incidents related to one core problem? Fix that problem and the workload is reduced. Is one major flaw in the incident management process responsible for constant delays that hamper the organization’s ability to respond in a timely fashion? If so, fix that flaw. Simply put, evaluate the knowledge gained from data analysis and prioritize the conclusions – putting those that most influence the process at the top of the list.

Understand that improving incident management is a continuous process

As is true with any other DevOps process, improving incident management practices will be a continuous challenge. Routinely perform data analysis to constantly refine and update your strategies over time. Rome wasn’t built in a day – and neither was a successful incident response plan.

Invincible Incident Management

Gaining insight through the use of incident management metrics

One major step in understanding how you can use incident monitoring data to your advantage is to understand the metrics you have at your disposal — and more importantly, to understand how you can combine these incident management metrics to draw insights that better your processes. But, there are three key measurements that come to mind when considering the efficiency of an organization’s incident management strategy and surrounding development practices:

1) The number of issues being reported for each system over a fixed period of time

Or, in other words, the volume of reported incidents for a given system. This is a metric that can prove valuable to an incident management team. Breaking down the total number of reported problems by the affected system will provide visibility into which systems are most stable and which are most unstable. By understanding the stability of the various supported systems and applications, an organization can make better use of its time.

For example, identifying unstable applications using incident data may lead to the implementation of additional testing to root out these problems before they make their way into a production environment. This reduces the negative effects of said incidents on end-users and decreases the amount of time the DevOps team needs to spend supporting existing functionality. The organization now has more time to innovate.

2) Time to incident acknowledgment/resolution

The amount of time it takes to acknowledge and resolve issues can also provide valuable insights for where an organization can improve. Are issues being acknowledged fast enough by those on-call and assigned to support roles? If not, then maybe the problem is with the organization’s process for alerting support personnel. Configuring incident management software to alert more frequently or more effectively (i.e. directly to the right person, text alerts vs. email alerts or through integrations with other collaboration applications) will assist in notifying the necessary team members in a more time-efficient manner. Undoubtedly, this will lead to faster resolution of critical issues.

3) Volume by on-call user or incident assignee

Which person or team is seeing the greatest amount of incidents sent their way? Answering this question can lead to improvement as well. If one team in particular is treading water just to keep things up and running, they may need additional support. Consider the scenario where one team is constantly playing whack-a-mole with reported issues to keep a system functioning. Do they have time to improve the system and reduce the number of future problems? The answer is likely no.

A reorganization of incident assignments, on-call escalations and/or temporary assistance from additional staff may help to keep these issues at bay. This allows the people most familiar with the system to spend more time performing post-incident reviews and bolstering weaknesses to cut down on the flow of incidents. In the end, this will result in a more stable system with less frequent outages.

Context makes incident management suck less

As is true with all software-related practices, data analysis is critical to the improvement of incident management procedures. Effective analysis of incident monitoring and response data can increase visibility into the challenges that hamper an organization. These insights can lead to great leaps in incident management efficiency, as well as improvements to other DevOps processes.

Centralize incident data with on-call schedules, automated alerting and collaboration tools with an incident management solution like VictorOps. Sign up for a 14-day free trial or request a free personalized demo to learn exactly how we make on-call suck less for DevOps and IT teams.

About the author

Scott Fitzpatrick is a Fixate IO Contributor and has 7 years of experience in software development. He has worked with many languages and frameworks, including Java, ColdFusion, HTML/CSS, JavaScript and SQL.

Let us help you make on-call suck less.

Get Started Now