Why We Should Stop Asking Why: Evaluating the Merits of Root Cause Analysis

Todd Vernon - February 07, 2018

The-End-Of-Root-Cause-Analysis-Blog-Banner

Systems fail, servers go down, shit breaks.

With the pressure of continuous uptime and world-class customer experience, we’re being asked to build, deploy, and operate our systems with increasing speed. Accordingly, our approach to incident management leaves little time for trial and error—we need to detect, respond, and remediate problems with accuracy that not only solves the problem, but prevents it from happening in the future.

In our effort to constantly improve our own internal incident management, we came to a startling realization about limiting factors in our approach to problem solving: root-cause analysis.

When something breaks, we habitually ask the simple question: why. This “why” gets to the heart of RCA. However, we aren’t building and operating simple systems anymore. Instead, we’re building complex systems that change constantly.

How can we continue to ask simple questions of complex systems and expect to arrive at a meaningful resolution?

Our Bias Towards Root-Cause Analysis

When it comes to understanding causality of IT problems, it’s natural for us to take a Newtonian approach to reasoning. Accordingly, we think about service disruptions and outages as a sequential series of linear, one-by-one events, moving from a healthy state to an unhealthy state.

This method of reasoning can lead us to an assumption: simply by asking the “right” questions, we can, eventually, trace the effects back to the original source, i.e. root cause.

At VictorOps, we’ve notice how a bias towards identifying a singular root cause diminishes our opportunity for exploring and learning more about how a system actually works. Root cause then becomes a limiting lens by which we see the world, establishing a false model of system reliability and incorrectly attributing failure to a single, static component or construct.

In our experience, most organizations aren’t blind to the variety of factors attributed to a system failure. However, during an outage, we tend to go into autopilot and default to habits. In this situation, here’s how things tend to break down:

  • Teams identify multiple “root” cause(s) of the incident
  • Then (albeit correctly) pointing out a variety of factors conspired to “cause” the problem
  • Which unintentionally sends a misleading signal, where a single element becomes the sole reason for the failure
  • Concluding the absence of the specific root element would make the system safer/better

Making the Case Against Root-Cause

Let’s pause here for a moment—because I’m sure the headline above was enough to make certain engineers blood boil.

We’re not calling for the elimination of root-cause. In certain instances, especially when approaching a highly simplistic model, root-cause can provide a direct, straightforward answer—sometimes there is actually only one root cause. Instead, our hypothesis is this:

With the compounding complexity of modern IT systems, the accumulation, expansion, removal, and modification of interconnected components—each with their own ways of interacting with other parts of the system—distilling failure (and success) down to a single entity is, often, a logical fallacy.

Thus, with root-cause analysis, causality becomes a limiting construct rather than a definitive method for discovering “truth.”

Post-accident attribution [of the] accident to a ‘root cause’ is fundamentally wrong. Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is [necessarily] insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident.

Richard I. Cook, MD, How Complex Systems Fail

The output of identifying the root causal factor is typically a set of “corrective actions” to put the system back where it was prior to failure and, presumably, prevent the same problem from recurring.

NOTE: You can never put a system back where it was. Even a roll-back isn’t going to return the system to its original state, only the codebase.

Exploring correlation and causal factors is important during post-incident reviews and is immensely beneficial. Such a process allows teams to build a more accurate mental model of systems—and how they actually work. However, when focus is placed solely on identifying only cause, we miss the opportunity to learn and explore more about what is behind the many layers of abstraction separating engineers from the systems they build and operate, not to mention the most important part of responding to today’s IT problems:

             How well did the team do at detecting, swarming to, and resolving the issue?

Reducing the time of the first three phases of the incident lifecycle (Detection, Response, Remediation) means reducing the time to recover (TTR) from inevitable and unpredictable problems—not to mention the cost of an outage.

Cost of downtime = Deployment Frequency x Change Failure Rate x MTTR X Hourly Cost of Outage (State of DevOps - 2016 pg 7) - DORA, Puppet

CostOfDowntimeFormula

The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is MTTR (Mean Time To Recover).

Benjamin Treynor Sloss (Google) - “Site Reliability Engineering” - How Google Runs Production Systems

By relieving ourselves from obsessing about root cause, we can begin to create a future state—ready and prepared for the reality of responding to “unknown unknowns.” This, in turn, sets us up to focus more on methods of “knowing about” and “recovering from” the nearly infinite possibilities of today’s complex and unpredictable IT systems.

Learning from Our Own RCA Mistakes

VictorOps Incident on December 22, 2017

Recently, VictorOps experienced a minor service disruption. There was very little impact to the customer. Much of that had to do with our efforts to be prepared for failure. It is inevitable, after all.

Here’s how it went down:

On December 22, the VictorOps production Cassandra cluster experienced failures affecting alert delivery and functionality for customers. Alerts were still being processed and notifications were being sent to customers, but there were delays in processing. This affected a few customer incidents, and they noticed it first—whoopsie! Read more on the incident here.

NOTE: We didn’t see this problem during the months of testing in our staging environment.

Upon initial analysis, one might claim a bug in the specific version of the recently deployed database was the root cause of the two-hour incident.

And it’s true, sortof. A bug was discovered, so presuming the absence of that specific defect would have allowed our systems to operate as expected is somewhat reasonable.

However, this version of the database behaved just fine in our production environment for eight days before things went sideways. It wasn’t until several contributing circumstances emerged that a disruption occurred (during the holiday break). As Sidney Dekker might say, we “drifted into failure.”

The bug had been present in our production environment for over a week under “normal” conditions—a latent problem waiting for just the right conditions to emerge. The bug was part of the problem but not the cause.

The database and components of the VictorOps service had been operating at the edge of failure; our engineering team had no idea of this until all of the conditions were just right.

Engineering time is valuable and long meetings can be expensive in the absence of actionable takeaways from an incident retrospective. We choose to spend little time theorizing and memorializing the root source of a unique problem—and more time building a stronger understanding of system functionality and improved incident detection and response.

We want to be proactive. Not reactive. We want to learn as much as possible and move the needle towards the goal of deeper understanding of the system rather than any notion of cause for individual incidents. There is much to learn about the system and what is behind the multiple layers of abstraction.

If we spent our 90-minute Post-Incident Review focusing solely on identifying the root cause, we wouldn’t have explored questions, including:

  • Did we know about this first, or did the customer discover it?
  • How did the customer contact us and how did the response go? Any suggestions for improving that experience?
  • How soon were engineers triaging the problem?
  • What helpful context would have been useful to first responders?
  • When should we update our statuspage? And who is responsible for that?
  • What tools were helpful in diagnosing the problem and is everyone aware of and trained on how to use them?
  • What assumptions of the behavior of the system AND the response teams formation were true or false?
  • How can we detect something similar to this faster (i.e. before a customer sees it)?
  • How much time elapsed during each phase and what ideas do we have to shorten each?
  • Are there countermeasures we can put in place to minimize the impact of this happens again? Graceful failure?
  • What were everyone’s assumptions about what was happening and remediation theories?
  • How could this have gone better?
  • What aspects of the system are now more clear as a result of this discussion?
  • What new questions does this raise about existing assumptions about the system?
  • What ideas can we generate around how to test those theories?
  • Are there new metrics we can put in place to collect data related to those theories?
  • Anyone from non-technical teams in attendance have anything else to share or ask?
  • What was your biggest takeaway from this exercise?
  • Any suggestions on how we can improve our next PIR?

The list of observations and action items to better increase our reliability and uptime were the result of our PIR rather than an artifact identifying causality.

In less than a day, our engineering team had already established new metrics around the health of our Cassandra database to improve our ability to detect a range of issues related to our critical paths of alert delivery and incident management. We believe this is a better approach to incident retrospectives in the realm of IT failure.

As engineers, each of us must continue to ask, “Who owns our availability?”

We do.

At VictorOps we value learning, discussion, and knowledge transfer of how our service works. This curiosity is of far greater value than identifying constructing cause for each and every incident that occurs—forever.

At the end of the PIR, our team had a stronger understanding of the reality of circumstantial system behaves, including the response to an unknown problem.

In this case it was approximately two minutes from the time a customer contacted support to when our first responder was acknowledging the incident (within VictorOps) at 2:15 a.m. on Friday December 22nd, when our team was beginning to enjoy their holiday break.

One-to-one cause and effect does NOT apply to complex systems! Definitively drawing a conclusion of root causality in IT systems is a fallacy. There’s always more to the story.

Thanks for nothing, Isaac Newton!