VictorOps is now Splunk On-Call! Learn More.
After his podcast interview about blameless postmortems with Richard Campbell of RunAsRadio, DevOps evangelist Jason Hand sat down with me to talk more about blame, the benefits of transparency, and the problems with root cause analysis.
**JK: In the podcast, you discuss the concept of a blameless culture. What takeaways would you share? **
JH: The concepts of a “blameless culture” or a “just culture” are part of the same conversation. We talk about those ideas a lot within our industry, but they are even more prevalent in the airline industry, which is why Matthew Syed’s book on Black Box Thinking is an important text for people in the IT space. The idea of learning from failure hasn’t really been widely embraced.
When it comes to having a broader, well-rounded life or job or industry, you have to try to understand that failure is natural and you can’t avoid it. Rather than being mopey or pointing fingers, you can just be logical about it. This is something that happens.
**The concept that you can’t engineer failure out of a system really stood out for me, and that you can only prepare yourself to detect and repair problems quickly. **
The reason why we have seatbelts and airbags and all of this stuff in cars is because you can’t really prevent an accident. You can only reduce the impact of an accident. We sometimes think that the world is more simplistic than it is and that there is always a cause and effect. And if we could understand the cause we could control the effect. Unfortunately, things aren’t that simple.
Many people dig in their heels and say, we can’t just give people a free pass to screw up. A lot of people have a hard time grasping this and want to hold someone “accountable.” What they really mean is, we’re going to hold someone responsible for making the mistake. But that person is part of a complex system. To then hold them responsible for mistakes is counter to making improvements.
**Which brings us to root cause analysis, and how it can have a negative impact, which is a controversial perspective. **
We are actually talking about root causes. People are expecting root cause to be a single thing instead of a series of things, and the phrase is misleading. We have tried to make our world a lot more simple, and sometimes we don’t want to acknowledge that it’s very complex. There are things that are part of the system that we have no control over and we just don’t see.
If we’re doing a post-incident analysis, and all we’re trying to do is make our system better so that our business performs better and our customers are happy, then pointing out individual people or individual root causes makes us spin our wheels.
The metaphor I heard recently is being on a treadmill of innovation where you’re spending a lot of energy but going nowhere. You’re just pointing blame and making that person feel like they are afraid to speak up when something doesn’t seem right later, and so they become really good at hiding details about what took place.
Etsy gives out the 3-arm sweater award. If someone does something that causes a bad result for the system, they are rewarded a 3-armed sweater. Even if you melted down the system in a spectacular way, you just uncovered a flaw in the system. It actually could have been any one of us. We now have a deeper understanding of our system. Why–or I should say how–did that happen?
Asking why insinuates a personality defect. How is it that Jason tipped over this water glass? Or slept through his alarm? By asking “how,” we ask what else can we do or try, so that if we’re in this situation again, we reduce the likelihood that the problem will reoccur. How are we trying to respond to a problem? How do we make it so that when a database fails, it has less of an impact to the users? Something along those lines.
And if something happened and we both sat down and conducted our own 5 Whys, we’d each come up with our own root cause. At the end of the day, the same thing would happen again because we didn’t look at it from a bigger picture. We can’t make it so we won’t wreck our cars; we can just make it so that we can walk away. People don’t want that. People want to know that they can control the outcome of things and prevent bad things from happening.
**It seems like we have to redefine what control means. It seems more realistic to focus on fixing things faster and lessening the impact versus making fewer mistakes. **
I’ve never liked the idea of incident prevention. Incidents and outages can’t be prevented completely. Rather than using prevention, I like the idea of incident preparedness. Are you prepared and able to respond to a problem quicker? Are you able to gather your team? Do all of the things necessary to deal with a problem? Because you certainly aren’t going to prevent everything. You are spending a lot of resources trying to accomplish the impossible. Our own egos get in the way. We want to improve. As a result, we will likely prevent ‘common’ problems and things we’ve seen before but more importantly we are prepared and rehearsed to deal with anything new that comes our way.
The reason why a lot of companies are approaching this is to unlock areas of improvement and become a learning organization. That’s why we have this new approach to trying to understand contributing factors rather than focusing on one single thing to fix.
This interview has been edited and condensed.
To hear more of his thoughts, listen to Jason Hand’s RunAsRadio podcast interview.
Your browser does not support the audio element.
The Infinite Hows (or, the Dangers Of The Five Whys) by John Allspaw Thinking, Fast and Slow by Daniel Kahneman Black Box Thinking: Why Most People Never Learn From Their Mistakes — but Some Do by Matthew Syed Beyond Blame: Learning from Failure and Success by Dave Zwieback