Being on-call is a necessary evil for DevOps and IT teams maintaining complex systems in an environment dedicated to CI/CD. Systems will break, services will experience outages. Software developers, operations support and IT teams need to take accountability for the services they build and maintain – which means taking on-call responsibilities. And, of course, these on-call responsibilities come with a set of horror stories and sleepless nights.
While many teams are using VictorOps to make on-call suck less, the fact of the matter is that being on-call inherently brings stress. So, in order to figure out how we can make on-call less stressful, we talked to a number of our own engineers and technical support teammates about some of their worst on-call stories and the things that bother them when they’re on-call.
Some on-call horror stories
Anyone who’s been on-call knows that being on-call can disrupt a number of personal activities and events. So, we wanted to go over the things that suck about being on-call and a few anecdotes to show the value of on-call incident management software for improving employee morale while helping reduce MTTA/MTTR.
Check out our three highlighted on-call stories:
1) Erica Struthers, Technical Support Engineer
Let’s set the scene. It’s the Sunday before President’s Day and Erica was on-call. Our technical support team maintains a daily on-call schedule and rotations for weekend coverage. And, on this particular weekend, Erica happened to be on-call.
Until Sunday night, it was a typical weekend on-call. Nothing major had happened, and Erica fell asleep around 10 PM. Then, about an hour later, Erica received the first notification. A customer had written in about a critical issue and she had to get up to help troubleshoot the problem. After about an hour and a half of back and forth communication, Erica had helped the customer return to a stable state. But, just as Erica was about to rest her eyes and fall back asleep, she was paged a second time.
Once again, Erica communicated with the customer for about an hour and a half before coming to a resolution. And just as she thought that incident was nearly taken care of, Erica was notified by a third customer. By now, it’s approximately 3 AM and Erica had slept for a little over an hour all night. And, after another hour or so of troubleshooting and back and forth communication with the customer, she was finally able to get some rest.
But, the story doesn’t end there. Erica woke up around 7 AM and was paged again multiple times throughout the morning. Once around 8:15 AM, another time around 9:30 AM and once more around 11:30 AM. Although the service hadn’t suffered an outage or anything, Erica coincidentally continued to be paged for critical customer incidents at very inconvenient times. Although she was engaged constantly in technical support until nearly 6:30 PM on President’s Day, Erica kept her cool and did an amazing job at keeping our customers happy.
TL;DR – On-call sucks.
2) Mike Meredith, Principal Security Analyst
Next up is a not-so-merry Christmas story about being on-call. In 2017, Mike Meredith was on-call for a crisis regarding one of our database cluster nodes. After further analysis, the team noticed that nearly half of the servers in one of our database clusters were critically low on storage, and one of them had actually gone offline due to insufficient storage.
Mike and a few team members worked throughout the night on Christmas Eve and throughout much of Christmas day to stabilize the system. And, in this on-call scenario, there’s no sleeping on the job. Mike had to run commands that took an hour or so to run, and he had to be sure he was awake when the task was completed so he could get started on the next steps. This lasted throughout the night and into the morning of Christmas day.
But, due to the hard work of Mike and the rest of the team, the incident never caused a customer-facing outage. (Big shoutout to these victors of on-call!)
TL;DR – On-call during Christmas sucks.
3) Chris Phelps, Principal Software Engineer
Chris worked at an organization that maintained 24/7 on-call support and he was part of the 2nd tier of support in case incidents needed to be escalated. In this specific case, there was an issue with a potentially backed-up data pipeline. The on-call responder (not Chris) received an alert with a runbook attached stating that a system component needed to be restarted.
But, the on-call responder was unsure of this solution and escalated the incident to Chris in the middle of the night. Chris picked up the phone and listened to the on-call responder’s problem. Basically, Chris ended up telling him to execute the instructions included in the runbook.
The moral of the story is that Chris was notified of an incident for which he didn’t need to be woken up. With highly integrated monitoring, alerting and collaboration tools, teams can automate processes, create deeper visibility into incidents, establish more self-healing architecture and alert on-call responders only when they need to be alerted.
TL;DR – Alert fatigue sucks. But, on-call automation can help mitigate it.
(Note: For more on-call nightmares, please check out our previous on-call horror story series.)
Things that suck about being on-call
In addition to specific on-call experiences that can ruin a person’s day, there’s a constant mental toll from simply being on-call. We polled engineers across our organization to compile a list of quotes about mildly frustrating things associated with being on-call:
“I’ve been on dates where my phone starts sounding obnoxious in the middle and I have to duck out early.”
“I’ll instinctively wake up a few times during the nights I’m on-call just to check my phone and make sure I haven’t missed anything, even if I get paged 0 times.”
“Ski weekends are difficult. You have to make sure you stay in coverage, make sure you can hear your phone through a jacket, and then, if bad stuff actually happens, frantically race back to the car and a destination with WiFi.”
“I’ve had sports practices where I have to awkwardly put my phone in a pocket and do the practice with it, or keep it on the sidelines and check every few minutes.”
“I’ve missed a page because I was in the shower and forgot to bring my phone in the bathroom with me.”
“Trail runs are hard. I’ve lost coverage in the middle and had to run as fast as I can back to a place where I’m good. The feeling of actually running for job security is unique. The same could be said for bike rides.”
“I can’t get my nails done because I wouldn’t be able to respond if I were paged.”
“You have to bring your computer everywhere with you – whether you’re going to dinner with friends, picking someone up at the airport, etc. Then, you have to be prepared to pull off the road and find a place with WiFi.”
Making on-call suck less
For anyone who’s been on-call, none of these stories are new to you. But, now there are some ways to make on-call suck less. Whether it’s through customized personal paging policies, automated alert routing and escalation policies, flexible on-call schedules, manual on-call shift adjustments, or snooze functionality, on-call incident management software is purpose-built for on-call teams experiencing these problems. Reduce MTTA/MTTR while simultaneously keeping up employee morale.
Let us help you make on-call suck less. Register to see the recording of our free webinar, How to Make On-Call Suck Less, to learn how you can level-up your incident response while keeping on-call employees happier.