VictorOps is now Splunk On-Call! Learn More.
What’s the number one cause of alert fatigue besides problems with your infrastructure? Babies. Imagine being up all night with a screaming infant while also responding to alerts. This is our Developer Advocate Matthew Boeckman’s on-call horror story. He stepped into our confessional booth and told us his worst outage experience. Watch the video below, or read on to find out why we awarded him with Most Debilitating.
Matthew’s on-call horror story took place in 2012, when he was working at Craftsy, an online education and retail marketplace for passionate hobbyists in the crafting space. Their entire business was predicated on the ability for users to register and transact on their systems, so uptime and the time to recover in outages were absolutely key for the business to be successful.
They had built their entire infrastructure in Amazon Web Services, and being new to the cloud, they built multi-availability zone resiliency.
On July 2, 2012, Amazon experienced rolling outages in Eastern US across all three availability zones. Each time an availability zone went down, Craftsy lost effectively their entire application.
On top of this already challenging issue, the incident was amplified for Matthew because his daughter was born five days prior, prematurely and with complications. Matthew was up all night with his daughter at the hospital. He says, “I was sitting with her in the Neonatal Intensive Care Unit for 43 hours while Amazon burned and my daughter recovered from being premature.” Throughout the night he would get alerts, respond, nurses would come in, AWS would have a different AZ failure, he would respond to that, and this cycle persisted for 43 hours, making it impossible for him to sleep.
AWS fixed all of their problems, and Matthew finally got a chance to sleep. He let his team tie up loose ends and fix some other bugs as they were just waking up.
With all horror stories there’s a learning opportunity. In this case, Matthew and his team learned a lot from an engineering standpoint, particularly the value of being N+1 in every AZ in a cloud infrastructure. They also learned they need to hire smarter so there isn’t just one person on-call. You never know, that one person could have a screaming infant in the NICU while trying to manage an entire infrastructure at the same time.
Finish with the final award in our series: The Worst First Time Award
Catch up with the other on-call horror stories.