“I’m grateful every day that we use VictorOps. And there is no hyperbole in that statement.”
One weekend I flew to Kansas City for my wife’s high school reunion. Just before I got on the flight, I opened VictorOps mobile app and tagged out of our on-call rotation. I messaged in the app, “Hey we’re boarding, can you grab on-call?” Then I’m not thinking about it any more. I tagged back in once we landed, and then tagged back out again when I had to go to the reunion.
If you’ve been in this world for only 5 years, these problems may not resonate yet. After 17 years on-call, I recognize that I don’t want to ruin the rest of my teams’ lives by not having a system like VictorOps in place. The last year using VictorOps has been materially and massively better for our teams – faster time-to-resolution, less on-call fatigue, less anxiety in the management team, and happier people.
A magically simple button called ‘take on-call’
VictorOps ‘take on-call’ button in the app is huge for us. It is so ridiculously simple, but it was the primary motivation for choosing VictorOps because it immediately solved all of the one-off events.
The on-call management tool we used before was ok at laying out who is on-call for the next 12 months, but absolutely terrible with one-offs. For example, let’s say that Bob is on-call but he has concert tickets. Because the old tool didn’t have a seamless solution for on-call hand-offs, he would just go to the show and turn his phone off, and everyone else would logon to their computers and wait. We were getting by with it but it was always painful.
It doesn’t matter what the event is – with VictorOps I can ensure coverage, see who is on-call, and not have to worry.
I remember I had a sandwich in my hand when the app system went down…
As a startup, we don’t always do a great job at writing everything down. Things are just moving too quickly. That’s why for us VictorOps’ timeline is key. The timeline gives us a precise view of the event that we can be analytical about.
We used to sit down as a team to do the post-mortem analysis. In trying to remember the exact time frame, one of our front end devs might say, “I remember I had a sandwich in my hand when the app went down, so it must have been about 12:30.” Then we’d start searching through email, SMS, Nagios, Splunk, Cloudwatch and all of our other systems for the alerts and communication that took place during that time. If you are doing big data analysis, you just can’t start with a broad time frame like that. You have to drill into the exact moment the incident occurred.
VictorOps pulls all of our alerts, context and team communication into one place. If the world is on fire and you’re spending 5 minutes of every 10 tabbing through different systems to respond to an event, you’re already losing. Nagios, Cloudwatch and Splunk all point to VictorOps, and we have done our own integrations with the API.
I can get into VictorOps and pinpoint an exact moment in time and see exactly what took place – that’s pretty incredible.
Knowing when not to alert
The systemic problem that VictorOps solves is managing on-call and life. With VictorOps, we are way better about not alerting – not alerting when isn’t actionable, or not alerting the wrong person.
The solution we used before VictorOps wasn’t flexible in terms of escalation, so our escalation policies were really aggressive. Since we didn’t always trust that the system would alert the right person on the right team in the right way, we would escalate after only 2 minutes. There were a lot of phones ringing when they didn’t need to be and a lot of alert fatigue.
We materially reduced alert fatigue by creating nested escalation policies in VictorOps and routing incidents to each team by type of event.
I have a really high confidence level in VictorOps. The system is reliable. When we have questions, their customer support team is fantastic. I only say that when it’s true, and I don’t say that very often.