VictorOps is now Splunk On-Call! Learn More.

Post-Incident Reviews Part Two: The Demise of Root Cause Analysis

Jason Hand August 25, 2017

DevOps Post-Incident Review

When incidents occur, the natural response is to investigate and pinpoint the cause before looking for a solution. However, this traditional approach assumes that causality is determinable. The modern IT professional needs to understand problems stem not from one primary cause, but from the complex interplay of our systems and the teams tasked with managing them.

As we dive further into my recent eBook, Post-Incident Reviews, Learning from Failure for Improved Incident Response, I’ll explore how choosing appropriate models can aid in incident management. If you missed my initial recap of the first three chapters, you can catch up here. I invite you to read on as I reveal additional methods for implementing a modern post-incident strategy. For a free copy of my eBook to share with your team, click here.

Here’s a preview of key insights from chapters four through six.

An Understanding of Ordered and Unordered Systems

You might be familiar with the term Cynefin, a Welsh word meaning habitat. In DevOps communities, it’s become synonymous with describing the manner in which we analyze behavior and decide how to act or make sense of complex systems. These broad system categories can be divided into Ordered Systems—complex systems that can be understood given time and investigation; and Unordered Systems—systems that only become knowable when examined in retrospect.

Taking it a step further, these systems can be broken down into the following subcategories: Simple, Complicated, Complex, Chaotic, and Disorder.

Using this Cynefin framework, further outlined in this diagram, DevOps teams can quickly identify what category an incident falls into and then utilize the best practices associated to find an appropriate solution.

Evaluation Models

Choosing a model that best represents an incident is a valuable first step in determining causation and, later, a strategy for solution. The following are the three most popular models for categorizing incidents:

  1. Sequence of events model: Illustrates issues as a domino effect; one event causing another.
  2. Epidemiological model: Presents problems in a system as undiscovered, but present. Hardware and software, as well as managerial and procedural issues, lay dormant until an incident uncovers their existence.
  3. Systemic model: Reveals issues as a result of a system’s state being negatively impacted by outside sources; i.e., people or organizations that create problems from a lack of knowledge or a strain on resources and time.

The Systemic model best represents how incidents are viewed in a modern post-incident world. Issues are the result of “normal” system behavior being disrupted by any number of unforeseeable—and potentially unavoidable—factors. Nothing truly went wrong, failures are inevitable—and often come from multiple sources, which culminate in a perfect storm. When IT professionals and organizations learn to view system failures in this way, the opportunity for growth and learning is greatly enhanced.

Creating Excellent Post-Incident Reviews

In addition to accurately defining problems within these representative models, improving post-incident techniques provides the best environment for continued operational success. As system and software development time becomes ever-reduced, it’s increasingly critical to adapt efficient post-incident reviews, which include retrospective analysis. This begins with creating a clear flow of information between departments and systems, eliminating any communication delays within the timeline of an incident, and streamlining feedback loops.

One of the main takeaways for teams after a post-incident review should be the ability to answer, “How do we know sooner?” and “How (specifically) will we improve?” By focusing on these two questions, several areas of improvement should emerge within the detection, response, and remediation phases of the incident lifecycle.

To better understand how modern post-incident strategies work together, chapter six explores the unique phases of a hypothetical outage. From detection, response, remediation, and analysis, this fictional case study demonstrates the lifecycle of an incident and how a systematic approach to learning from failure can positively influence the future. To read the full case study in its entirety, download the complete eBook now.

Read the rest of the post-incident review blog series:

Let us help you make on-call suck less.

Get Started Now