ITIL (or Information Technology Infrastructure Library) is a set of best practices focused on delivering IT services aligned with business requirements. ITIL problem management is a systematic approach to ITSM (IT Service Management) – helping organizations manage risk, enhance customer experience and build a robust IT infrastructure.
The ITIL service lifecycle consists of five stages - Service Strategy, Service Design, Service Transition, Service Operations, and Continual Service Improvement. The fourth stage, Service Operations, is responsible for the management of technology, applications and IT infrastructure in order to ensure on-time delivery of services to the customers.
In this article, we’ll be focusing on ITIL Problem Management, an important aspect of Service Operations, as well as how DevOps is affecting both incident management and problem management frameworks.
According to ITIL, a problem is the cause of incidents. Server downtime or mouse malfunctioning are mere incidents and do not indicate a problem. However, when the same incidents are frequently repeated (e.g. repeated network outages, hardware failures or an inadequately configured DB query), they become a problem. The main difference between incident management and problem management is that the former restores a service whereas the latter eradicates the root cause of a failed service.
Successful problem management requires continuous planning and a disciplined approach to conducting post-incident reviews and root cause analysis. Problem Management based on ITIL best practices ensures efficient handling and monitoring of problems within the organization. Pershing, a brokerage firm, restructured their service desk based on ITIL guidelines and it helped them reduce incident response time by 50% within one year.
In addition to incident management, each step in ITIL’s problem management lifecycle is essential to successfully resolving a problem and delivering a quality service. Let’s take a deeper look at the ten steps of problem management in ITIL:
The first step is to identify the problem. Incidents are considered problems when they:
In the ITIL framework, all problems are logged via a ticketing system in a problem record, which is a compilation of every problem the organization comes across. Providing information like date and time of occurrence, symptoms, reference of related problems and the steps taken to troubleshoot them can help you find the root cause(s) of a problem.
Categorization includes assigning a primary and secondary category to the issue. It allows your service desk to sort and model incidents that occur frequently, enabling them to gather and report service desk data. This service desk data helps you identify problem trends and assess the effect on service demand.
A problem is prioritized based on its urgency and its impact on users and business. Prioritizing a problem allows you to use your resources most efficiently. Prioritization also helps you mitigate the damage to SLAs (service level agreements) by reallocating resources as soon as an issue is detected.
The investigation and diagnosis of a problem solely depends on a problem’s assigned priority. High priority problems are addressed first because of their impact on service delivery. The process involves the analysis and testing of incidents logged in the problem report that could not be tested at the service desk level.
Problems are not resolved at the incident level, and it can take hours to months to resolve a problem. A short-term workaround is essential because it enables your service desk to restore services and communicate failures and outages to customers while the problem is being resolved in parallel. However, a workaround should always be considered a temporary solution, since a problem is only considered open until resolved.
The next step is to record a known error. As soon as a workaround is identified, it should be communicated to the employees as a “known error.” You should always record a known error in an incident knowledge base or a known error database (KEDB). By documenting it, your service desk can resolve the incidents quickly, avoiding further similar problems from being raised.
You should resolve as many problems as possible. However, some resolutions might require a change management board, since they can affect your service levels. For example, switching your database can slow down your operations. Thus, all associated risks must be evaluated and accounted for before you implement a resolution.
Once a problem has been identified, logged, categorized, prioritized, diagnosed, and resolved, it is considered closed.
Most organizations usually stop at this stage, but the ITIL problem management lifecycle recommends an additional step (10th step) – reviewing the problem in order to help prevent the occurrence of future problems. During this phase, your team conducts post-incident reviews and analyzes the problem document to identify areas for improvement.
Teams can reference the problem log to identify mistakes and take appropriate measures to prevent similar problems in the future. Post-incident and problem reviews will help with process improvement, operational efficiency, and increased employee productivity.
Problem management reduces the number of incidents over time, thereby increasing the service quality and decreasing the overall cost associated with service downtime. Implementing DevOps in the ITIL framework can further enhance overall service delivery:
Since DevOps engineers continuously improve and look for underlying causes of the problems to learn from, they can proactively define problems (in the initial design phase of the lifecycle) that might occur in the future, preventing known defects from moving downstream.
By taking the DevOps approach, you’ll ensure better communication and collaboration between multiple teams involved in the ITIL problem management lifecycle. The problem managers can share an analysis of the problems they face and their impact on the service, allowing DevOps-centric teams to focus on high priority issues.
Managing problems efficiently while maintaining a robust IT infrastructure can help you save cost while increasing employee efficiency and productivity.
Combining DevOps and the ITIL problem management framework allows teams to focus not only on development and on-time delivery but also on the continuous improvement of processes and the continuous performance of functionality (when in production).
Leveraging DevOps with elements of the ITIL framework ensures high-quality continuous deployments and speeds up the incident and problem management lifecycles, helping you rapidly deliver value to your customers.
VictorOps centralizes monitoring tools and incident response to improve problem detection and create real-time collaborative workflows. Try a 14-day free trial of VictorOps to reduce problems, make on-call suck less and build more reliable services.