Post-incident reviews, commonly called post mortem reports are a critical and highly understated process of the incident lifecycle. DevOps-centric teams simply can’t improve without retrospective, blameless analysis of incident response and remediation. Teams can improve operational efficiency by compiling helpful data such as incident details, playbooks, chat history, etc. into one single place to conduct highly detailed post-incident reviews that improve the on-call experience for humans.
A lot of great teams take on efforts to find the root cause of an incident or problem. While this is a great start, finding the technical root cause isn’t enough. Root cause analysis is only one part of the post-incident review process. But, understanding the shortcomings of your technical system, alongside an understanding of your team’s human response leads to a more holistic process for continuous improvement.
Although this quote from George Santayana isn’t specific to software development and IT operations, it applies to any topic where value can be found in historical records:
“Those who cannot remember the past are condemned to repeat it.” - George Santayana
Monitoring, alerting, remediating and repeating isn’t a sustainable process. As you continue to deploy new features and build more integrated services, you need to silence unactionable notifications and add reliability to your current architecture in production. By thoroughly analyzing the incidents occurring in your applications and infrastructure, engineers will better understand issues when they pop up and will drastically reduce MTTA/MTTR. Then, once incidents are resolved, you can proactively assess and improve the reliability of your system.
In addition to keeping centralized records for machine data and chat, encouraging a positive, blameless culture is just as important. Outages are going to happen, no matter how robust your system. Without blame, on-call responders are more inclined to act quickly during incident response and less likely to hide information after the fact. A blameless culture rewards people who are willing to take action and collaborate, and bolsters the overall reliability of the services you create over time.
By putting developers on-call and pushing a DevOps culture focused on developer exposure to systems in production, accountability for code and heightened transparency across all workflows, teams improve the reliability of the services they build and maintain. With more open conversations across disciplines and a lowered focus on assigning blame, you’ll tighten the relationship between developers and IT operations – leading to more robust applications and services. Also, new ideas will frequently come up in an open, blameless culture – helping your product and business drive faster innovation and better customer experiences.
So, how do you build a process for conducting post-incident reviews? Let’s dive into a general process flow that helps DevOps teams get the most from their post-incident reviews.
As we look into building a post-incident review process, keep in mind that every team operates differently and a one-size-fits-all template doesn’t work for everyone. But, there are always a few key goals you should try to achieve from your post-incident reviews, and some best practices and actionable metrics to track to help you get there:
(Side note: Check out our post-incident review template to download an interactive pdf that you can use as a starting point for your own team.)
While this is a rough layout for a post-incident review process, it’s a great place to start asking the questions you need to ask in order to start implementing post-incident review best practices. An outage won’t wait for you, so it’s always best to be prepared for the worst. Post-incident reviews and post mortems lead to an at-the-ready, prepared on-call incident response team; and a more prepared team allows you to build new services faster while reducing MTTA/MTTR for incidents that do come up.
In our State of On-Call Report, we found that, on average, 73% of the incident lifecycle is spent on incident response. So, post-incident reviews that are simply focused on processes and tooling – and not the people involved – won’t holistically improve the incident lifecycle over time. Painting the full picture of what happens during an incident leads to deeper insights and helps teams optimize the human part of being on-call.
After an outage, a holistic post-incident review or post mortem includes collaboration details, not only root cause analysis of what went wrong technically. Continuous improvement of the people behind the product clearly bleeds into the reliability and efficiency of the applications and services you create. So, a post-incident review process focused on incident response is essential to DevOps success.
But without transparent incident data and communication, thorough post-incident reviews are hard to create. Being more transparent, alongside a blameless culture, leads to a better understanding of how people are building new services and how they’re resolving outages when they happen. Through greater workflow transparency and collaboration, DevOps teams can be more proactive about adding reliability in the services they build and maintain.
A single-source-of-truth for post-incident details makes it easier for an incident commander or team lead to compile data after the fact and build a useful post mortem report. You can build a full history of the tools used, tasks performed and communication that took place during the incident to truly understand what worked well and what didn’t.
No matter what you want to call them – post-incident reviews or post mortems – you can’t deny their usefulness in building robust applications and services. By first building a blameless, DevOps-centric culture focused on continuous improvement and then implementing a process for conducting thorough post-incident reviews, you’ll be able to build the future faster.
Improve incident management with VictorOps on-call software. Use the highly transparent timeline to collaborate in real-time when an outage occurs and leverage the post-incident review report to continuously make on-call suck less. See for yourself – try a 14-day free trial today.