In the traditional IT Infrastructure Library (ITIL) approach to IT service management (ITSM) and IT operations, root cause analysis is required for effective incident management. But, over time, DevOps and IT teams are learning that there’s rarely one single root cause. Sure, one singular action (e.g. a new deployment) can result in one, short-lived incident. But, what about all the other actions leading up to that action? In order to understand the true root cause of an incident and improve service resilience, the team needs to look at all factors – people, processes and technology – both before incident detection and after incident resolution.
Better root cause analysis also requires a transparent culture focused on continuous improvement, not blame. Constantly looking to assign blame only causes employees to hide valuable information and results in a lack of visibility in incident management. Instead, modern DevOps-centric organizations are focused on creating comprehensive post-incident reviews focused on taking blameless corrective action to improve the entire system.
So, we wanted to take a human-centric approach to root cause analysis and lay out the template for better post-incident reviews.
Conducting real root cause analysis (RCA)
Simply thinking about the technical root cause of a production incident means you’re only looking at half of the equation. In a world of continuous delivery and complex, highly-integrated applications and infrastructure, it’s likely you’ll have more than one root cause for an issue. What actions did the team take leading up to the incident? How did the team identify the issue and what did they do to respond to the problem? Was there an outage with a third-party dependency that’s out of your control? Was there an issue with all of the above?
Modern root cause analysis (RCA) doesn’t look only at the technology involved in an incident but it addresses the people and process behind the technology. If a developer added code to a recent deployment that brought down the service, it’s likely not just that developer’s fault. Why was the issue missed during QA or undetected in staging? Can the team do a better job with automated testing in order to avoid this incident in the future? And, once the incident was noticed, what can the team do better during the incident response phase?
Because human error is unavoidable, the best teams are not trying to avoid it, they’re instead finding ways to mitigate its impact on the greater system.
Humane root cause analysis
Root cause needs to take human error into account and approach the situation holistically. In a world of continuous integration and delivery (CI/CD) and more frequent deployments, incidents are unavoidable. The only surefire way to improve service resilience is to create more redundancies and failover options, speed up rollbacks, limit single points of failure and better prepare for incident response.
So, let’s dive into our template for humane root cause analysis and see how DevOps and IT teams are building more reliable services while making incident management suck less.
The modern root cause analysis template
Root cause analysis in today’s software development and IT operations landscape requires a template that looks at how people, processes and technology interact with each other. IT professionals are no longer responsible for deploying code written in a silo by software developers. Developers and IT operations work together closely from the beginning of the software delivery lifecycle to the end – shortening feedback loops, ensuring adequate test coverage and catching incidents before they affect customers.
Below, the root cause analysis template shows the questions DevOps and IT teams should ask in order to adequately understand their systems and improve service resilience over time.
Key goals for root cause analysis
What can you do to improve incident detection speed?
How can you improve incident response and recover faster next time?
What did you learn about the system as a whole? (people, processes and technology)
- Where was the core issue? (bad code, testing, deployment, new hire onboarding?)
- What steps can you take to improve?
Key metrics to report on during a post-incident review
Time to acknowledge (MTTA for multiple incidents over time)
Time to recover (MTTR for multiple incidents over time)
Time spent in each phase of the incident lifecycle
- When was the incident detected? (date and time)
- When was the service restored? (date and time)
- Who was the first person alerted to the incident?
- Who was the first on-call responder to acknowledge the incident and when was the incident acknowledged?
- Was the incident escalated? Who else helped out and when did they enter the firefight?
- What tasks or commands were executed, when were they executed and who executed them?
- Which tasks made a positive impact on incident remediation?
- Which tasks made a negative impact on incident remediation?
- Which tasks had no impact?
Maintain a record of the chat that took place during the time of the incident
- What kind of information was shared?
Learning from the real root cause
The real root cause of an incident is never one thing – it’s a combination of actions and decisions made over time. Focusing on the past can help teams learn from mistakes and improve service operations. But, the only real way forward in the rapidly-changing landscape of DevOps and IT is to prepare for the future. Holistic post-incident reviews involving inputs from people, processes and technology are replacing the traditional approach to root cause analysis.
And, as teams conduct more thorough post-incident reviews, they’re learning more about the benefits of DevOps-oriented collaboration and transparency. By shifting testing and QA further left in the development process, teams are exposing vulnerabilities and reliability concerns before they reach customers. IT professionals are getting more exposure to the development lifecycle and developers are learning more about how their code works in production. The real root cause of an issue can’t come down to one line of bad code. And, if it is, you have a much larger problem with the entire process.
(Side note: Check out our post-incident review template to download a free, interactive pdf that you can use as a starting point for your own team.)
DevOps and conducting holistic post-incident reviews
Constant learning and continuous improvement are the only way to ensure more resilient applications and infrastructure without hindering deployment velocity. Preparing for real-time incident response and finding issues earlier in development will always lead to happier customers and more reliable services. Successful teams are juggling rapid scalability with increasingly complex systems in microservice, containerized and hybrid cloud architectures. In order to combat these concerns, developers and IT professionals can’t work in silos.
Today’s version of effective root cause analysis is more in-line with the DevOps approach to post-incident reviews. Root cause analysis shouldn’t only find problems in applications and infrastructure, it should detect workflow problems between people. Use this root cause analysis template to help you create a complete system for identifying weaknesses in the system and making services more reliable over time. Today’s best engineering and IT teams aren’t hiding from failure, they’re learning from it.
Looking to learn more about conducting better post-incident reviews? Get your free copy of our O’Reilly eBook, Post-Incident Reviews, and see exactly how modern teams are learning from failure and creating more resilient systems.