VictorOps is now Splunk On-Call! Learn More.

Post-Incident Reviews: Learning from Failure for Improved Incident Response

Jason Hand August 18, 2017

Integrations DevOps Post-Incident Review

As technology advances at a breakneck pace, expectations and challenges simultaneously increase. Clients expect flawless service, 24/7 support, and quick, easy-to-implement solutions.

As an IT professional and DevOps evangelist, I’ve come to understand that to manage these expectations, new and updated methods for detecting, resolving, and improving systems need to evolve.

In my recent eBook, Post-Incident Reviews, Learning from Failure for Improved Incident Response I reveal modern methods for implementing a successful post-incident strategy. For a free copy to share with your team, click here.

Here’s a preview of top takeaways from the first three chapters.

Explore the Complete Event Timeline

The key to this modern take on post-incident analysis relies heavily on a reimagined approach to analysis. Instead of DevOps teams focusing solely on the particular (I.E. Root Cause) issue and eventual fix, teams should be encouraged to explore the complete timeline of events in detail. Taking the time to truly examine what transpired and how it was handled (or not handled) correctly is essential to create better strategies for the future.

Add These Concepts to Your Toolbox

Here are a few leading practices that will help you prepare for more effective post-incident analysis:

  • Consider how questions are framed and posted to your DevOps teams. Avoid using phrases that could be interpreted as judgmental, as they do little to improve morale or success. There’s a subtle difference between “Why did you do that?” and “What was your thought process?”, but that minor distinction can play a significant role in how your team responds to future incidents.
  • Do not rely on root cause analysis (RCA). Using RCA, DevOps teams are encouraged to pinpoint the primary cause of an incident and develop a plan or checklist to prevent similar incidents from happening in the future. This approach often results in a disconnect between departments, as perception can vary. For example, a DevOps team may see the “why” as a lack of budget or a need to upgrade technology, while upper management may flag the “why” as human error.
  • Blame the systems as much as the people. For IT professionals, finding and fixing problems is an endless, often thankless, uphill battle. The sooner organizations recognize and accept that problems and issues will continually arise, the sooner a proactive system can be put in place—one that strives to improve operations without discouraging people.

Make Changes and Test Them

So, how would I advise companies begin improving their systems? In a word—change. It’s a simple concept, but most can agree, change is much easier to desire than actually do. The only way to see if these revolutionary ideas work is to test them—and that takes initiative and initiative stems from confidence and entitlement.

Companies must strive to empower team members to make hard judgement calls under pressure and to be proactive movers rather than passive reactors to problems. A company that validates and embraces the human element when incidents and accidents occur learns more from a post-incident review than those who punish actions, omissions, or decisions taken.

I am, by no means, advocating for the tolerance of negligence, but a culture of fear minimizes growth, stifles learning opportunities, and encourages teams to hide potentially valuable information.

Make It Safe to Surface Knowledge and Mistakes

With all of that said, it’s crucial to maintain a level of accountability amongst teams. There’s an important distinction between discovering a problem and being held responsible for the eventual outcome. It’s a balancing act among sharing, productivity, and liability. When engineers feel safe to openly admit mistakes and discuss incident and remediation details, they surface both knowledge and experience, which allows the entire organization to learn something to help avoid a similar issue in the future.

Read the rest of the post-incident review blog series:

Want more? Download the full eBook now.

Let us help you make on-call suck less.

Get Started Now