VictorOps is now Splunk On-Call! Learn More.
As technology advances at a breakneck pace, expectations and challenges simultaneously increase. Clients expect flawless service, 24/7 support, and quick, easy-to-implement solutions.
As an IT professional and DevOps evangelist, I’ve come to understand that to manage these expectations, new and updated methods for detecting, resolving, and improving systems need to evolve.
In my recent eBook, Post-Incident Reviews, Learning from Failure for Improved Incident Response I reveal modern methods for implementing a successful post-incident strategy. For a free copy to share with your team, click here.
Here’s a preview of top takeaways from the first three chapters.
The key to this modern take on post-incident analysis relies heavily on a reimagined approach to analysis. Instead of DevOps teams focusing solely on the particular (I.E. Root Cause) issue and eventual fix, teams should be encouraged to explore the complete timeline of events in detail. Taking the time to truly examine what transpired and how it was handled (or not handled) correctly is essential to create better strategies for the future.
Here are a few leading practices that will help you prepare for more effective post-incident analysis:
So, how would I advise companies begin improving their systems? In a word—change. It’s a simple concept, but most can agree, change is much easier to desire than actually do. The only way to see if these revolutionary ideas work is to test them—and that takes initiative and initiative stems from confidence and entitlement.
Companies must strive to empower team members to make hard judgement calls under pressure and to be proactive movers rather than passive reactors to problems. A company that validates and embraces the human element when incidents and accidents occur learns more from a post-incident review than those who punish actions, omissions, or decisions taken.
I am, by no means, advocating for the tolerance of negligence, but a culture of fear minimizes growth, stifles learning opportunities, and encourages teams to hide potentially valuable information.
With all of that said, it’s crucial to maintain a level of accountability amongst teams. There’s an important distinction between discovering a problem and being held responsible for the eventual outcome. It’s a balancing act among sharing, productivity, and liability. When engineers feel safe to openly admit mistakes and discuss incident and remediation details, they surface both knowledge and experience, which allows the entire organization to learn something to help avoid a similar issue in the future.
Read the rest of the post-incident review blog series:
Want more? Download the full eBook now.