VictorOps is now Splunk On-Call! Learn More.
Ishikawa’s fishbone diagram is a method for visualizing and analyzing nearly any problem to find the root cause of an issue. According to TechTarget, the diagram was invented by Dr. Kaoru Ishikawa, a Japanese quality control expert. The methodology can be used both proactively and retroactively to help determine the cause and effect of a current problem or the potential of future problems. In IT and DevOps, the fishbone diagram can be used to look at a major incident and assess everything that led up to the incident – helping you visualize the full scope of an incident, not just a singular root cause.
More than anything, Ishikawa’s fishbone diagram is a way to visually represent the timeline of an incident. You can look at a problem and retroactively identify potential “causes” of the incident due to the people, processes and tooling involved. While the fishbone diagram isn’t specific to DevOps and IT, the practice works quite well for post-incident reviews. So, let’s take a look at how Ishikawa’s fishbone diagram applies to incident management and can help make on-call suck less.
As you can see, the fishbone diagram looks like the skeleton of a fish. By starting at the head of the fish and working back through the rest of the skeleton, you can determine the root cause(s) of an incident and take action to improve incident management and response workflows. Adding the fishbone diagram to your post-incident review process can help you find exactly what’s working and what’s not. Over time, it improves workflow transparency and collaboration across software development and IT operations teams.
The incident lifecycle consists of five stages: detection, response, remediation, analysis and preparation. The first three steps – detection, response and remediation – are most important as incidents occur in real-time. But, analysis and preparation are undertaken after the incident has already been remediated. DevOps and IT teams track incident management KPIs and metrics over time to ensure they’re learning from their post-incident reviews and becoming more efficient.
Ishikawa’s fishbone diagram for incident management shouldn’t replace post-incident reviews, it should augment them. The team can use the diagram to visualize workflows and talk holistically about their operations – helping them hone in on problems and identify the best areas for improvement. Only by acknowledging every element of the incident response and incident management process will you continuously improve and build more reliable services.
The fishbone diagram can be used to break down nearly any problem in your life. But, it’s particularly useful for DevOps and IT. Not only can it help you find bottlenecks in software delivery but it can improve the way you deploy and manage services in production. Because of this, you can use Ishikawa’s method to improve transparency between developers and IT professionals – helping you find ways for cross-functional teams to collaborate around solutions.
When building a fishbone diagram for DevOps or IT, it’s best to look at four main things: people, process, technology and environment.
Who was involved in the incident? How did each of those people get involved in the incident? Determining how people interacted during an incident and what worked well can help you focus on future on-call solutions that work. It’s important to service resilience to question the way humans interact with each other and the systems they maintain. Without looking at the way people affect each other and the underlying applications and infrastructure, you’re only analyzing half of the equation.
What is your incident management process? How are you setting on-call schedules and what is the monitoring and alerting strategy in place? Can you add escalation policies or alert routing rules to improve the way notifications are being received? The process includes elements from people, technology and environment but is focused on the interactions between the other three parts. The best way to find areas for process improvement is by asking a lot of questions. How exactly did the incident get detected and what steps were taken to remediate the problem? Limiting the number of steps it takes to identify an incident and fix it will shorten the incident lifecycle – giving developers and sysadmins more time to focus on the delivery of future customer value.
What tools were involved in incident response? What monitoring and alerting software were used to detect the issue and get people involved? Which solutions did the team use to collaborate around the problem? Ishikawa’s fishbone diagram allows you to see how technology affects incident management and on-call operations. Is the team missing anything that would have helped them solve the issue faster? Or, is the team feeling tool fatigue from needing to use too many tools? Don’t work hard for your technology, make technology work hard for you.
What was the environment when the incident struck? Could the critical incident have been resolved in seconds if a knowledgeable infrastructure engineer hadn’t been on PTO? Was it the middle of a workday when everyone’s in the office or did the issue pop up at 3 AM? The way the team needs to work together to resolve the incident will be much different at 3 AM vs the middle of the day. Adding as much context to the situation as you possibly can will help you understand why people made certain decisions. The environment around an incident can help you put processes in place to ensure rapid incident response for any scenario.
After looking at the people, process, technology and environment leading up to an incident, you should rank each of them by severity. Ranking each piece of the workflow from 1 to 4 can help you determine which part of the system contributed the most to the incident. This way, you can prioritize future work in the right places of the on-call incident management process – making the largest changes where they count most.
Ishikawa’s fishbone diagram is an excellent visual depiction of what you’re trying to learn from post-incident reviews. Whether you choose to use the fishbone diagram or not, you need to analyze major incidents after-the-fact. Without post-incident reviews, you have no way to learn from what works and what doesn’t and make improvements. In DevOps and IT, continuous intelligence and post-incident analysis are the only surefire ways to build more robust services faster.
Centralize alerting, incident response and collaboration from detection all the way to the post-incident review with VictorOps. Sign up for a 14-day free trial or request a personalized demo with our sales team to start making on-call suck less and conduct better post-incident reviews.