The Analysis phase of the Incident Lifecycle occurs after an incident is resolved, and focuses on learning. This phase is often referred to as a postmortem, but increasingly teams prefer the term “Post Incident Review” (PIR), or “Post Incident Analysis” to describe the event and activities surrounding unpacking of the causal factors that lead to an undesirable state.
For most of my career, and indeed the Information Age, this phase tracked through the lense of Root Cause Analysis. RCA is a helpful framework that codifies an approach to understanding the underlying cause of a failure in a system. While inclusive of contributing factors, RCA encourages practitioners to “get to the bottom” of a problem by a relentless, dogged hunt for a single causal entity.
Simple Systems, Simple Answers
RCA has its roots in the manufacturing plants at Toyota. Over the years it has grown and changed to encompass increasingly complex systems and interactions. In our own industry, RCA has been the predominant method of fault analysis. Engineers rarely adopt things that don’t work, and this is a valid datapoint in considering the efficacy of RCA in helping a team understand a failure.
While that is true, it’s important to evaluate RCA in the proper context. For most of the Internet Age, software teams built relatively simple monoliths. 3-tier architecture has dominated the design paradigms of the past 20 years. While these design choices can lead to some complex realities, much of that complexity was managed either down or out by the software development practices of the age, namely Waterfall.
RCA was born of an age where changes were highly controlled, infrequent, and massively coordinated events. Whether manufacturing, process, or software,minimum change as a concept took hold as a way to manage unknowns and risk.
Certainly in Software there were good reasons—it used to be hard to write code. Transferring data took a long time. Disks were slow to write data. Stated differently, Agile as a practice could not have been successful 30 years ago; the systems were just too slow to rapidly accept change.
State is a Vector
As we all know, the days of monolithic systems are behind us. Today, teams are building microservices, distributed data compute platforms, and adaptable machine learning algorithms. These architectural choices are born of a desire to enable rapid changes to infrastructure, and de-coupled lines of dependence. Mostly, they deliver on that promise, while creating fairly massive operational complexity. For an incident response team, success and career longevity is based on their ability to successfully manage that complexity in a reliable, expected system state.
Thinking about system state is important when your team is in the Analysis phase. Frequently, an incident responder starts off diagnosing a problem by asking the question, “what changed?”. This is a natural first step, but hearkens to RCA, 3-tier architecture, and the old days of annual, not hourly, change. The question itself implies a static view of system state—previously, before the alert, everything was fine. Now there is an alert, something must have caused it, and that something was probably a change_._ This simplistic view is the mental state most native to systems with static, easy-to-describe, system states.
If an incident is the result of a feature flag activating an under-tested user story, leveraging previously dark code and exhibiting a bug under a particular volume, which, in turn, floods a queue with poor autoscaling characteristics…. What was the Root Cause? The feature flag would be fine if the volume was lower, the path had been better tested, the queue was better at scaling… So are they all at fault? Should we have four RCA’s, each associated with the component failures? Or is RCA improperly focusing us on these static views, when the CI/CD reality is that system state is best described as a vector?
Cynefin: A Better Approach
As teams wrestle with these new patterns and realities, they look for better tools to aid them. Much as Agile was the tonic for slow feature delivery and all-or-nothing releases, I believe Cynefin is the tool to aid on-call teams manage this complexity. Created in the early 2000’s at IBM by Dave Snowden, Cynefin invites teams to accept a nuanced, complex understanding of these systems.
Analysis with Cynefin starts by mapping the behavior you’re trying to understand to one of four quadrants: Simple, Complicated, Complex, and Chaotic. Each quadrant suggests a course of action, appropriate for the inherent similarities in patterns at increasing levels of complexity.
Simple patterns require little more than an awareness (Sense), a categorization of the type of pattern (Categorize), and the Response. A hung server, instance, or container needs to be restarted. Simple!
Complicated patterns represent known unknowns—patterns we have suspected may occur but have never previously presented. You likely have metrics around these problems but no alerts or dashboards… yet.
Complex patterns are the unknown unknowns, essentially complex, or appearing so because we don’t have data around them. Here you’ve got to start by probing, searching, digging in some, sense as new data is gathered, and then respond.
Chaotic are rare, thankfully. Something is going wrong in a big way, and no prior experience with the system seems helpful. Buckle up, it’s going to be a long ride. Here a team is encouraged to Act: disrupt, change, stop something, see how that affects the overall pattern, and iterate.
Interweaved throughout is the possibility of an event slipping into Disorder. Disorder is a human artifact, when teams fail to reach agreement, fail to collaborate or communicate, or otherwise operate in unconstructive ways. There lies Disorder.
Dynamic at the Core
Cynefin is useful in the moment, or in the Analysis phase, for a team trying to understand a pattern, and what to do with it. The real value of this approach, however, lies in the guidance it offers for improving incident response through practice and effort. When focus is given to a Complex pattern like the example above, by building better auto-scaling or improving test efforts, we can move it from Complex to Complicated. Effort can change the game as patterns move clockwise transforming into increasingly easier to manage states.
Conversely, poor management of the systems ensure that patterns regress (counter-clockwise). An oft-alerting disk (Basic) left unresolved eventually fails catastrophically, creating a Complicated, or perhaps even Complex, pattern. The Post Incident Review is the ideal time for teams to understand these patterns, including how their efforts can positively affect the outcome.
Start with PIR
Root Cause Analysis is a reactive framework. It exists only after the event, and does little to aid a team in thinking about how to improve, and prepare for the future. RCA focuses on a binary “works/failing” mentality through which we, correspondingly, view our systems. As our systems become more complex, and our rates of change increase, we must adopt the right mental stance to be successful. Ultimately, RCA has been a useful framework, whose time has run out.
Adopting a new framework in your Post Incident Review is no small feat. Changing the way our systems work is hard; changing how our minds work doubly so.
A simple but effective answer to this problem is to ask yourself, “What quadrant is this?” at the beginning of each PIR. Ask again “Was it actually in X quadrant?” as the team reviews and analyzes the event. Finish with the question, “How can we move this pattern from X to Y next time?” With practice and iteration you will develop muscle memory around these questions. Over time, you’ll find your mental state changes along with it. The original question, “What changed?” will become, “What quadrant?”.
Therein lies a key to successful analysis of patterns in Incident Management.
VictorOps offers a holistic solution for real-time collaborative incident response and provides comprehensive reporting to make detailed post-incident reviews a staple in your organization. Try out a 14-day free trial of VictorOps to see how it can make on-call suck less for your own team.