VictorOps is now Splunk On-Call! Learn More.

On-Call Horror Story Number Three: This Wins the Most Grueling Award

Marlo Vernon October 26, 2017

DevOps Monitoring & Alerting

They’re cringe worthy. They’re nerve racking. They make you sweat just thinking about it. That’s right, we’re talking about on-call horror stories.

We asked five VictorOps employees to step into our confessional booth and tell us their worst on-call horror story. Today, Todd Vernon, the CEO and cofounder of VictorOps shares his infamous on-call event. After hearing about it, we decided to award him with Most Grueling.

Watch the video below—or read on—to learn about his marathon firefight.

The Situation: Supporting the First Reservationless Conferencing Seystem

At the time, Todd was the CTO of Raindance, a company that provided reservationless voice conferencing. Raindance was able to make reservationless voice calls using specialty telephony boxes, which worked together with a Sun server. These telephony boxes would terminate four thousand phone lines and then mix those into conferences. When a customer dialed up and entered their access code, it would hit the Sun server, which would validate the code and place the customer in the conference.

Raindance was a new company with only a handful of customers, and they had just landed their biggest customer: Wells Fargo.

The System Failed—Right before Super Bowl Sunday

It was the Friday before the Super Bowl of 1998. The Denver Broncos were playing the Green Bay Packers, so everyone in the Colorado-based company had Sunday plans. Accordingly, Raindance planned a software deployment for 2 a.m. on Friday night, providing plenty of time to fix any bugs before Super Bowl Sunday. However, after the team deployed Friday morning, the system wouldn’t restart. Two hours later, Todd got a call from the team that the system still wasn’t working. So at 4 a.m., everyone that was technical, which was about 25 people including Todd, immediately mobilized and drove into the office.

Todd and his team started debugging and looking at the code with no luck—zero calls were able to go through the system. At one point, Wells Fargo contacted them asking why they were unable to place calls. With thousands of calls being made everyday, Wells Fargo relied on the phone system to work. The pressure was on.

All day Saturday turned into all night Saturday night. At this point, the engineers were completely spent—they had been up working on this problem for 24 hours now. Todd shared how exhausted the team was, “There were people just sleeping on the floor because no one could leave.”

By noon on Sunday, it looked like the issue was about to ruin the entire weekend. In a last ditch effort, the telephony box vendors brought in an entire new set of gear. However, after installing the new gear, the system still wouldn’t work. The team was at a loss for words. How could the brand new gear not work?

With a Stroke of Luck, the Day Was Saved

Finally, they were able to trace it back to the Sun server. The entire problem came down to one file—a file someone happened to be looking at by chance. Finding this file revealed the binary the system loaded was corrupt. Todd explained how shocking this was, “That never happens. In my entire career, I’ve never had a corrupt binary…Ever.” Finally, they got a new build, and the system was back up and running. The total downtime was 48 hours.

After the madness, the engineering team stayed in the office and streamed the Super Bowl, missing celebrations with friends and family. With everything that had happened, they were too scared to leave the system alone. They eventually went home around midnight after the Super Bowl had ended—by the way, the Broncos won 31-24.

Todd’s Lesson

Even though the event was stressful and traumatic, Todd and his team learned some valuable lessons. Todd explained, “When something goes horribly wrong, don’t bring everybody in. More ideas are good to a point, but if you don’t solve it in the window of a normal human’s ability to stay awake, the value they are giving you goes down exponentially as they get tired.” He mentioned that it was also good for the team because they realized they had to have a lot more instrumentation in place. The entire team learned the value of checking all systems even if it was not software they personally built.

Todd ended his interview with three simple words: “That was horrific,” a common ending statement from several of my interviewees. It goes to show how memorable these on-call horror stories are—and how much they affect the engineers involved.

Read the next award in our series: The Most Debilitating Award

Catch up with the other on-call horror stories.

Let us help you make on-call suck less.

Get Started Now