VictorOps is now Splunk On-Call! Learn More.

Why the Entire Incident Lifecycle Matters

Tara Calihman April 03, 2015

DevOps Monitoring & Alerting

Being on-call is more than just scheduling and alerting.

It’s a matter of following an incident through from the moment of receiving the alert to solving the problem and conducting a post-mortem. In order to help you better visualize it, we’ve created this graphic that breaks the incident lifecycle into six distinct parts and shows much each part makes up of the whole…

IncidentLifecycle

As you can see, getting alerted to the problem is really just the tip of the iceberg. There are lots of tools that alert you to a problem but not many that stick with you through the firefight, providing a way to collaborate in order to solve the problem faster.

Many on-call engineers report that the triage and investigation phases, while taking up the most time, are also the most stressful of the incident lifecycle. If you have no idea how to resolve an incident and no idea where to go to find the information you need to effectively troubleshoot, then it’s easy to see how difficult it becomes to work your way through an incident without external help. Add to that the pressure of having to escalate or pull in other team members and you’ve increased the amount of anxiety tenfold.

The State of On-Call Report contains data surrounding TTR and how each phase of the incident lifecycle can affect that number.

stateofoncall3

oncallblogpost2

on-callblogpost

The real question becomes one of saving time and money. How can you improve the ways that your team is responding during each phase of the incident lifecycle?

What if you had problem-solving tips baked into your alerts? Having the right information at your fingertips means that the triage and investigation phases can be significantly decreased, getting an incident to the resolution phase quicker.

If you want to take a deeper dive into the different phases of the incident lifecycle, we created a helpful guide that breaks down each part. Let us know what you think - is there anything we missed?

incident_lifecycle3

Let us help you make on-call suck less.

Get Started Now