Jason Hand is DevOps Evangelist at VictorOps and author of Post-Incident Reviews, a new ebook about learning from failure to improve incident response. Jason spoke with Marlo Vernon about the writing experience. This interview has been edited and condensed.
MV: What prompted you to write this book?
JH: Since I’ve been working at VictorOps, the topics covered in the book, especially blameless postmortems, have been not only the most relevant to the company, but also the most interesting to me. It’s a very different perspective from what you typically hear in business.
The concept that somebody always has to be held accountable is a twist on what I had learned about IT and about managing problems. There are many guides, templates, and blog posts about doing postmortems, but there isn’t a lot of material that goes deeper into why we do them, their purpose, their value, how to understand complex problems within systems, root cause, things like that. I thought this topic was in line with what VictorOps stands for, and is helpful for a lot of people.
How would you define a post-incident review?
Post-incident reviews are part of a family of retrospectives called learning reviews. The point of them is to simply learn as much about what the “system” is doing at any given time, including the people part.
When you do a learning review, you try to expose as many observations or learnings as you can. The more you can identify what information isn’t common knowledge, the more it should be surfaced and shared among the group. That way, everybody has the same awareness and the same expectations about what should happen.
There are many different things you can learn: how people interact with each other, where information on how to resolve incidents is stored, and who has access to which systems when they need them. A lot of this isn’t discussed in most settings.
Tell me about the lifecycle of an incident and how a post-incident review can affect it
The lifecycle of an incident as we define it at VictorOps goes through five steps and starts with detection.
The detection phase kicks in when some expectation with the software or infrastructure is not being met. Monitoring services will detect the situation and then let somebody know.
The next phase is the response phase, which is focused on getting the right people together to deal with what’s happening. So if you’re using a service like VictorOps, there will be some sort of paging and escalation policy in place. They’re all designed to make sure that the right people are alerted to the problem as soon as possible.
The third step is the remediation phase, when the people have looked into what’s going on and are starting to take steps to fix the problem and restore service. At that point, the incident is “resolved,” paging stops, and the incident can be closed.
The fourth phase brings in some sort of analysis of that whole process. You go back to the beginning and ask, “What can we do better about that? What happened in the response phase? Can we shorten the time it took to respond? What happened in the remediation phase?”
Then the fifth and final phase is the readiness phase. This is where you take what you learned and try to make proactive changes, enhancements, improvements, and countermeasures to make the system better. That way, the next time something happens, the whole lifecycle might be shortened because you’ve found a better way to detect the issue or solve the problem. It should help you shorten the whole lifecycle, but you’ve got to go through all five phases.
You stress that you shouldn’t blame someone when conducting a post-incident review. Why is that so important?
The main reason why blaming can be so detrimental is that it actually creates incentives to not share information. If people don’t feel that it’s safe to share information so that you can learn as much as we can, then you’re missing out on the opportunity to make things better. I might be afraid to speak up, even though I know that there’s a problem in the system. If I had spoken up, I could have helped the team make improvements. It has to be a safe space.
What are the main takeaways from your book?
The first takeaway is that your best bet for increasing system uptime is by improving the ways that you learn about and respond to problems. There are a lot of companies who spend too much time focusing on “root cause.”
This leads me to the second takeaway: please consider moving away from searching for a root cause. Most systems are complex, so there is usually not a single root cause of a problem. I hope people recognize that when you think you’ve found a root cause, you stop asking questions. All those additional questions could have been helpful for learning more about what’s going on.
The third takeaway is that there is no real prescriptive approach to conducting post-incident reviews. I offer some basic guidelines and some suggestions of things you want to try to capture, but what works for us at VictorOps, or what works for any of our customers, isn’t going to work for every single team. Hopefully you’ll absorb the ideas in the book and then take that back to your teams and design your own post-incident review in a way that’s going to benefit you.
Read Jason Hand’s O’Reilly eBook, Post-Incident Reviews.