Get up to 50% off! Limited time only: Learn More.

The Incident Lifecycle Guide Header Image

The Incident Lifecycle Guide

In traditional ITIL (IT Infrastructure Library), it’s called the Incident Lifecycle. Others know it as the process through which you go about solving IT issues in real-time, as they happen. You might simply call it, “A long night without sleep.”

Whatever terms are used, following an alert from once it comes in to seeing it resolved can be a long process. Fortunately, with the right tools, surviving this challenge can be made much less daunting.

When you break it down into its parts, fighting the good fight becomes simply a matter of choosing a solution that works with each part of the incident lifecycle.

Stages of the incident lifecycle

6 Stages of the Incident Lifecycle Image

Alerting

Incident Lifecycle Alerting Section Header Image

“I received notification that a critical alert has come in.”

If you break down typical incident resolution into phases you’ll see that, generally, the smallest portion of time is spent being alerted to the problem. On average, our data shows that only 5% of the total incident lifecycle has anything to do with alerting or problem escalation. There are incidents where a team member doesn’t respond to an alert. But, when this happens, it’s often more about the individual team member than the platform getting the alert to the right person.

Historically, the alerting phase was a longer portion of time to resolve (TTR). Back when teams actually carried pagers, those systems were quite slow and led to slower alert acknowledgement. Now that team members have smartphones, human behavior is in more of an “always-on” state, constantly engaged with their technology. Mobile alerting, and alerting in general, has come along for the ride of WAN, LAN and SMS data.

Nonetheless, the truth of the matter is that a perfect “zero-time” alerting solution, one which finds people instantly, can only affect average TTR by a very small percentage.

What part does VictorOps play?

  • Push, SMS and phone notifications (with customizable ring tones) will make sure that no one misses an alert

  • Additionally, rich alerts provide context around the incident and baked-in solutions make the on-call person’s job easier

Triage

Incident Lifecycle Triage Section Header Image

“I know there’s a problem but I have no idea who or what is affected.”

This is the phase of the incident lifecycle that can cause the most stress. It’s anxiety-inducing for someone new to the on-call process to find out exactly what’s wrong by picking up the phone to call someone, who may or may not be awake, and who may or may not have the right answer. But, it’s even worse if the incident happens at 3 AM.

Based on our internal customer data, 18% of the TTR is simply getting an initial person (or subsequent team member) up to speed with what’s happening. This information is rarely contained completely in the alert meta data, but rather requires seeing other markers in the system as well.

We call this situational awareness, and having situational awareness in the platform can have a big impact on TTR. The faster you can get the right eyes on the problem, the faster you can solve the problem.

What part does VictorOps play?

  • The incident timeline provides a single view of all activities surrounding the incident, including alerts, paging, chat messages and the ability to reroute the alert to a different person or team

  • Also, with intelligent alert routing that can send an alert to the person who knows how to fix the problem, triage becomes less stressful

Investigation

Incident Lifecycle Investigation Section Header Image

“I need help digging into the issue.”

The majority of time to resolve (TTR), a full 40%, falls into what we call the investigation phase of incident management. Investigation requires the on-call person to play the part of detective by following up on possible leads while ruling out the usual suspects.

This phase includes logging into the system, tailing logs, consulting performance monitoring tools, etc. It also involves consulting internal documentation resources such as wikis or ticketing systems. Anyone can triage but it takes a higher level of advanced thinking to figure out that one thing that’s broken and may be causing everything else to break.

If triage has been successful, the on-call engineer has already figured out who else needs to be involved in this phase. Getting other people involved early means that they can help look at the series of events and possibly recognize a pattern you may have missed.

What part does VictorOps play?

  • Our timeline allows for easy IT and DevOps collaboration. You can chat, send private messages or mention an entire team in a question posted to the timeline.

  • Annotations attached to alerts can provide much-needed direction as to how the problem was solved last time or who to contact to find an answer.

Identification

Incident Lifecycle Identification Section Header Image

“Everything will be better if I fix this one thing.”

Once you know what the problem is, you just need to find the answer. Identifying the incident is typically much easier said than done. Depending on what your company’s documentation is like, getting your hands on updated response protocol might just be the hardest part of solving the problem. Remediation docs may be stored in an internal wiki, a spreadsheet, or in some cases, someone’s head.

Today’s systems are so complex and ever-changing that they require new ways of maintaining them. Add the fact that many teams now have a varied group of individuals taking part in the on-call rotation – meaning that, if a database team member is responding to the alert, they need to know how to solve problems outside of their domain of expertise.

Imagine if every alert automatically surfaced contextual information and suggested solutions to resolve the problem. Knowing exactly how to solve the problem is one surefire way of reducing TTR. Having real-time remediation data right where you need it, when you need it.

What part does VictorOps play?

  • The ability to annotate alerts with links to internal runbooks, graphs and notes about how the problem was solved means the on-call person can actually solve new incidents much faster.

  • Knowing the solution will be easy to find is a major game changer when it comes to making on-call suck less.

Resolution

Incident Lifecycle Resolution Section Header Image

“I’m fixing it.”

10% of TTR falls into the incident resolution stage of the incident lifecycle. This is represented by team members performing system actions to fix the problems that started the incident. It unfortunately also means waiting for systems to recover and verify that the incident’s root cause was found and fixed, often extending team involvement longer than desired.

The resolution phase is perhaps the largest potential lever in a true collaborative system. To reduce TTR in the resolution phase, you need a feature set that self-documents what teams do to solve the problem. This is, in a sense, the heart of collaboration: the ability to not only reduce TTR during the current resolution cycle, but also capture that knowledge to pay it forward next time.

Fix the one thing, watch the other things get better. Update documentation so you can easily fix that one thing again in the future.

What part does VictorOps play?

  • With bidirectional integrations with chat applications like Slack, it’s easy to collaborate and manage many aspects of your infrastructure from the comfort of your firefighting chat room.

Documentation

Incident Lifecycle Documentation Section Header Image

“I don’t want to have to deal with that again.”

After an incident is resolved, on-call best practices mandate that a post-incident review, or retrospective, take place. An accurate, comprehensive post-incident review is an essential tool for communicating with internal and external stakeholders. But more importantly, it helps prepare and ideally prevent similar incidents from occurring again.

The ideal report would pull together everything that happened during the entire incident lifecycle, with a single authoritative clock that gives context to the event and includes all relevant communications. The post-incident report would also be customizable - allowing you to edit the documentation, remove unimportant details and add notes where applicable. The post-incident review should provide a high-level snapshot of exactly what the incident entailed.

Want to know even more about your on-call process? Add reporting around incident metrics such as incident frequency and on-call metrics so you have a much larger picture of what’s working with your alerting and what’s not.

What part does VictorOps play?

  • Our reporting gives you the ability to improve your process around incident resolution and helps to facilitate documentation cleanup by letting you make notes about the accuracy/helpfulness of annotated alerts while in the moment.

Make on-call suck less

Shorten the incident lifecycle with an incident management tool that encourages a culture of collaboration and transparency. Surface software delivery and incident details across your entire organization with a centralized platform for on-call schedules, intelligent alert routing, and chat to make on-call suck less.

Ready to start using a holistic end-to-end incident management solution? Sign up for a personalized demo with one of our product experts or go at it yourself in a 14-day free trial.

Get Started Request A Demo