World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
In DevOps, ITSM, and the ITIL framework, outlining the differences between incident management and problem management is imperative. But, by acknowledging the current industry-standard definitions of incident management and problem management, and the differences between the two, you can prioritize workflows and better understand how you respond to issues.
While IT service management (ITSM) and the IT infrastructure library (ITIL) depend heavily on a distinction between problems and incidents, the distinction means something a little different for DevOps teams. Traditionally incident management refers to fixing an issue as quickly as possible, while problem management refers to fixing the underlying issue, or the root cause.
With DevOps, this idea of fixing the root cause has completely changed. There really isn’t a single root cause for a problem, and with the DevOps incident management and problem-solving methodology you can create more of a comprehensive system for continuous improvement–helping you constantly iterate and build more robust systems faster. But, because DevOps, ITSM, and ITIL encounter a fair amount of overlap in the industry, it’s important we clarify the current standard definitions of incident management and problem management.
IT operations teams have handled incident management and problem management a certain way for a number of years:
Now, both of these steps clearly need to happen in any organization that encounters an issue in their system. But, the problem with this approach to incident and problem management is that it only identifies what’s wrong with the infrastructure or application. However, the entire system for incident and problem management includes not only technical systems but people operations and processes as well.
Enter DevOps. DevOps helps integrate the software delivery lifecycle, people operations, and processes with problem and incident management workflows. Through a collaborative, transparent approach to writing new code and taking ownership for systems in production, you create a holistic process for deploying new services and fixing issues.
While ITSM and ITIL do a decent job at addressing issues in your services, DevOps helps you take it to the next level. In ITIL, there’s a philosophy of Plan-Do-Check-Act (PDCA). PDCA is a good starting framework for monitoring system performance and conducting service reviews, but it doesn’t proactively help you build reliable systems.
With DevOps, you create a more holistic process by combining sprint planning and new deployment information with your current understanding of systems already running in production. Improved visibility and deeper collaboration between developers and IT operations reduces incident acknowledgment and resolution time (MTTA/MTTR), as well as future feature development.
With further exposure to both systems in production and new code releases, the entire team gains a more holistic understanding of your infrastructure or service. Then, when something goes wrong, teams can act faster and conduct thorough post-incident reviews to continuously identify areas for improvement.
Simple monitoring and alerting is the way of the past. Both monitoring and alerting serve as excellent tools in the greater toolbox for software delivery, incident management, and problem management, but they’re not enough to create a fully observable system. DevOps brings developers and IT operations closer together, adds visibility to the software delivery and incident lifecycles, and allows you to build reliable software faster.
Problem management and incident management are really just the first stepping stones toward building reliable systems. Leveraging DevOps and a culture of continuous improvement drives reliability in systems and resiliency in people and operations.
Our free eBook, How DevOps Plays Into the Incident Lifecycle, dives deeper into the way DevOps creates efficient incident workflows, speeding up incident acknowledgment and resolution time, and helping you build reliable systems faster.