Marlo Vernon - September 27, 2017
We’re back again with the second tale in our series of on-call horror stories. Our first story featured Bryce’s most embarrassing on-call experience. This time, Dan Hopkins, VictorOps VP of Engineering, stepped into the confessional booth to tell his worst outage story. Watch the video, or read on to learn about why his experience was the most paralyzing.
Dan’s worst outage story took place at LivingSocial, a business that sells location-specific coupons for products and services. Dan built a promo code service in which someone could buy a coupon and receive 10 percent off its face value. The goal was to limit those 10 percent off promotions to specific, targeted customers, and keep them off universal coupon sites. Dan was the only engineer on the project.
Once Dan built the system, he integrated it into the checkout flow. If a particular promo code wasn’t legitimate for use, then the service prevented that customer from checking out. If the service failed, people couldn’t buy anything, and LivingSocial wouldn’t make money.
Before deploying the new service, Dan’s team ran a test. They sent a promotional email to 100 people, who successfully used the new service. Assuming that everything worked, LivingSocial sent the promotion to 20 million people.
Dan saw messages starting to pop up on Campfire asking why LivingSocial had stopped processing transactions. He knew immediately it was the promo code service he had just built.
As the service was crashing, Dan was in Washington, D.C. visiting a remote office. He was out of his element with no coworkers around for support. Nobody even knew the promo code service existed. At this point, the problem was starting to seriously impact revenue.
Dan started investigating. Payment requests were hanging because every time someone tried to check out with a promo code, the system was choking as it ran a series of rules through a CSV file of 20 million user IDs in order to validate the code.
First, they tried restarting the service. Transactions would work for a little while, but the system would repeatedly get overwhelmed and stop working. To make things worse, the promo code leaked onto Slickdeals, which just magnified the problem.
“There was just a moment of paralysis,” Dan said. “I had built the whole thing, it was falling over, and you couldn’t push code to fix it in that moment.”
Luckily, the team had put in a circuit breaker that let them disable all promo codes in case of a problem. So they triggered the circuit breaker which stopped everyone, including people with legitimate promo codes, from using them.
Everything turned out fine within the next couple of weeks. Dan brought in another developer, and they worked through the problems together to stabilize the service.
Still, it was a harrowing experience. Dan describes how helpless he felt, “It was one of my worst moments being an engineer. Just that experience of being alone and there was nothing I could do,” he said. “I went into another room and called my wife and said, ‘I think I need a new job.’”
During these conversations, my interviewees’ faces reveal the strong emotions they felt during these worst moments of their careers. Luckily, teams today have help guiding them through incident management. Dan’s story wins the award for Most Paralyzing.
Jump ahead to the next award in our series: The Most Grueling Award
Catch up with the other on-call horror stories.