Continuing on from the first part of my post, I wanted to dig in a bit and talk about the differences between alerting and collaboration. The VictorOps platform is built to provide functionality in all phases of the Incident Lifecycle.
Alerting and Acknowledgment: As stated, about 10% of incident resolution is tied back to incident identification and routing, enacting company and individual escalation policies and finally contacting the correct individual via their desired contact method. Easy to use and integrated mobile apps help reduce the effort to silence or acknowledge the incident.
Triage and Situational Awareness: Based on our observations, 20% of the TTR is simply getting an initial person (or subsequent team member) up to speed with what is happening. This information is rarely contained completely in the alert metadata, but rather requires seeing other markers in the system as well. We call this situational awareness, and having situational awareness in the platform can have a big impact on TTR. In fact, in serious situations when the team works through successive resolution cycles driven by escalations, this 20% remains fairly constant. This is the main concept of the timeline view in the VO platform. As we integrate other great tools and services into that display, a clearer picture of what is happening will be available faster, reducing the 20% hit on TTR.
Investigation and Communication: A full 50% of TTR falls into what we call the Investigation Phase of Incident Resolution. This involves logging into the system, tailing logs, consulting performance monitoring tools, etc. It also involves consulting internal documentation resources such as wikis or ticketing systems. As we evolve the VO platform, our goal is to surface these resources in context of incidents and their timelines to help reduce TTR for teams investigating the problem. This often falls under the concept of Alert Enrichment.
Those that remember the Linux Leap Second Bug may recall that it was several hours before Google results showed that the problem was a systematic kernel level problem in Linux. This is a perfect example of how a platform like VO could be helpful to the greater good by noticing a very specific symptom across a large set of customers and informing teams via the timeline or by system-wide broadcasts. We believe there are a lot of gains to be made in total TTR by addressing some of these issues.
Problem Resolution and Documentation: Finally, 20% of TTR falls into a category we call Problem Resolution and Documentation. This is represented by team members performing system actions to fix the problem that started the incident. It unfortunately also means waiting for systems to recover and verify that the root cause was found, often extending team involvement longer than desired. The Problem Resolution phase is perhaps the largest potential lever in a true collaborative system, and one that our Alpha and Beta customers are the most excited to see us build. To reduce TTR in the Resolution phase, you need a feature set that self-documents what teams do to solve the problem. This is, in a sense, the heart of collaboration: the ability to not only reduce TTR during the current resolution cycle, but also capture that knowledge to pay it forward next time. This is some of our secret sauce for next year, so enjoy the sneak peek.
The Power of a DevOps Collaborative Platform
The real power of a collaborative solution for real-time problem solving is in its ability to leverage the entire team’s knowledge in a scalable way through all phases of the incident lifecycle. Teams that can share situational data easily, inject thoughts and comments, and leverage team members who are not physically present is key to reducing the TTR for the company and a better lifestyle for the team as a whole.
Collaboration for DevOps teams goes beyond communications however. A powerful platform can also actually learn how the team solves problems in order to help solve the same problems or similar problems the next time they occur.
VictorOps intends to be the glue that brings teams together to solve problems faster and reduce TTR. This, my friends, is the difference between an alerting service and a collaborative platform.