Todd Vernon February 07, 2018
Systems fail, servers go down, shit breaks.
With the pressure of continuous uptime and world-class customer experience, we’re being asked to build, deploy, and operate our systems with increasing speed. Accordingly, our approach to incident management leaves little time for trial and error—we need to detect, respond, and remediate problems with accuracy that not only solves the problem, but prevents it from happening in the future.
In our effort to constantly improve our own internal incident management, we came to a startling realization about limiting factors in our approach to problem solving: root-cause analysis.
When something breaks, we habitually ask the simple question: why. This “why” gets to the heart of RCA. However, we aren’t building and operating simple systems anymore. Instead, we’re building complex systems that change constantly.
How can we continue to ask simple questions of complex systems and expect to arrive at a meaningful resolution?
When it comes to understanding causality of IT problems, it’s natural for us to take a Newtonian approach to reasoning. Accordingly, we think about service disruptions and outages as a sequential series of linear, one-by-one events, moving from a healthy state to an unhealthy state.
This method of reasoning can lead us to an assumption: simply by asking the “right” questions, we can, eventually, trace the effects back to the original source, i.e. root cause.
At VictorOps, we’ve notice how a bias towards identifying a singular root cause diminishes our opportunity for exploring and learning more about how a system actually works. Root cause then becomes a limiting lens by which we see the world, establishing a false model of system reliability and incorrectly attributing failure to a single, static component or construct.
In our experience, most organizations aren’t blind to the variety of factors attributed to a system failure. However, during an outage, we tend to go into autopilot and default to habits. In this situation, here’s how things tend to break down:
Let’s pause here for a moment—because I’m sure the headline above was enough to make certain engineers blood boil.
We’re not calling for the elimination of root-cause. In certain instances, especially when approaching a highly simplistic model, root-cause can provide a direct, straightforward answer—sometimes there is actually only one root cause. Instead, our hypothesis is this:
With the compounding complexity of modern IT systems, the accumulation, expansion, removal, and modification of interconnected components—each with their own ways of interacting with other parts of the system—distilling failure (and success) down to a single entity is, often, a logical fallacy.
Thus, with root-cause analysis, causality becomes a limiting construct rather than a definitive method for discovering “truth.”
Post-accident attribution [of the] accident to a ‘root cause’ is fundamentally wrong. Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is [necessarily] insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident.
— Richard I. Cook, MD, How Complex Systems Fail
The output of identifying the root causal factor is typically a set of “corrective actions” to put the system back where it was prior to failure and, presumably, prevent the same problem from recurring.
NOTE: You can never put a system back where it was. Even a roll-back isn’t going to return the system to its original state, only the codebase.
Exploring correlation and causal factors is important during post-incident reviews and is immensely beneficial. Such a process allows teams to build a more accurate mental model of systems—and how they actually work. However, when focus is placed solely on identifying only cause, we miss the opportunity to learn and explore more about what is behind the many layers of abstraction separating engineers from the systems they build and operate, not to mention the most important part of responding to today’s IT problems:
How well did the team do at detecting, swarming to, and resolving the issue?
Reducing the time of the first three phases of the incident lifecycle (Detection, Response, Remediation) means reducing the time to recover (TTR) from inevitable and unpredictable problems—not to mention the cost of an outage.
The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is MTTR (Mean Time To Recover).
— Benjamin Treynor Sloss (Google) - “Site Reliability Engineering” - How Google Runs Production Systems
By relieving ourselves from obsessing about root cause, we can begin to create a future state—ready and prepared for the reality of responding to “unknown unknowns.” This, in turn, sets us up to focus more on methods of “knowing about” and “recovering from” the nearly infinite possibilities of today’s complex and unpredictable IT systems.
Recently, VictorOps experienced a minor service disruption. There was very little impact to the customer. Much of that had to do with our efforts to be prepared for failure. It is inevitable, after all.
Here’s how it went down:
On December 22, the VictorOps production Cassandra cluster experienced failures affecting alert delivery and functionality for customers. Alerts were still being processed and notifications were being sent to customers, but there were delays in processing. This affected a few customer incidents, and they noticed it first—whoopsie! Read more on the incident here.
NOTE: We didn’t see this problem during the months of testing in our staging environment.
Upon initial analysis, one might claim a bug in the specific version of the recently deployed database was the root cause of the two-hour incident.
And it’s true, sortof. A bug was discovered, so presuming the absence of that specific defect would have allowed our systems to operate as expected is somewhat reasonable.
However, this version of the database behaved just fine in our production environment for eight days before things went sideways. It wasn’t until several contributing circumstances emerged that a disruption occurred (during the holiday break). As Sidney Dekker might say, we “drifted into failure.”
The bug had been present in our production environment for over a week under “normal” conditions—a latent problem waiting for just the right conditions to emerge. The bug was part of the problem but not the cause.
The database and components of the VictorOps service had been operating at the edge of failure; our engineering team had no idea of this until all of the conditions were just right.
Engineering time is valuable and long meetings can be expensive in the absence of actionable takeaways from an incident retrospective. We choose to spend little time theorizing and memorializing the root source of a unique problem—and more time building a stronger understanding of system functionality and improved incident detection and response.
We want to be proactive. Not reactive. We want to learn as much as possible and move the needle towards the goal of deeper understanding of the system rather than any notion of cause for individual incidents. There is much to learn about the system and what is behind the multiple layers of abstraction.
If we spent our 90-minute Post-Incident Review focusing solely on identifying the root cause, we wouldn’t have explored questions, including:
The list of observations and action items to better increase our reliability and uptime were the result of our PIR rather than an artifact identifying causality.
In less than a day, our engineering team had already established new metrics around the health of our Cassandra database to improve our ability to detect a range of issues related to our critical paths of alert delivery and incident management. We believe this is a better approach to incident retrospectives in the realm of IT failure.
As engineers, each of us must continue to ask, “Who owns our availability?”
At VictorOps we value learning, discussion, and knowledge transfer of how our service works. This curiosity is of far greater value than identifying constructing cause for each and every incident that occurs—forever.
At the end of the PIR, our team had a stronger understanding of the reality of circumstantial system behaves, including the response to an unknown problem.
In this case it was approximately two minutes from the time a customer contacted support to when our first responder was acknowledging the incident (within VictorOps) at 2:15 a.m. on Friday December 22nd, when our team was beginning to enjoy their holiday break.
One-to-one cause and effect does NOT apply to complex systems! Definitively drawing a conclusion of root causality in IT systems is a fallacy. There’s always more to the story.
Thanks for nothing, Isaac Newton!