Aaron Atwell - August 17, 2016
Last night at three a.m, Dale was awake and sweating, frantically trying to fix a technical problem that broke his company’s online store. The shopping cart page had been down for four minutes, and many international customers were trying to place orders. How many of them gave up?
After waking Andy for information about the latest system updates, Dale identified and fixed the problem. The site was back up by four fifteen a.m. Heart pounding, cortisol rushing through his body, Dale got back in bed to try to sleep for a couple of hours before heading into work.
I have been thinking a lot about people like Dale, and about burnout. It was a big topic at the Monitorama conference, which I attended alongside the people who monitor, configure, and program the infrastructure that powers IT systems at companies like Google, Airbnb, Dropbox, and Twitter.
If you are not part of this industry, you may not realize that on-call IT teams are the watchdogs of the online world we all love, hate, and rely on. If you’re tweeting, booking a hotel room, or getting files from your Dropbox account, you expect services like these to work all the time. When they don’t work as expected, imagine the impact on customers, investor confidence, and the general public’s perception. Not good, to say the least.
On-call teams have the exhausting and often thankless job of answering alarms and fixing issues, often under extreme pressure. This can lead straight to burnout.
According to the article Minds Turned to Ash, by Josh Cohen, “The exhaustion experienced in burnout combines an intense yearning for [a] state of completion with the tormenting sense that it cannot be attained, that there is always some demand or anxiety or distraction which can’t be silenced.” In on-call support, we can take that statement literally.
But how do the healthy support teams avoid burnout? Here are three tactics that speakers shared. Perhaps they’ll be applicable to your life as well.
Since customers are taught to expect technology to work impeccably 24 hours a day, it makes sense that the health of a company’s infrastructure has a huge impact on the health of the people responsible for keeping those services running. If a company’s infrastructure isn’t maintained well, constant alerts could wear people down and significantly impact their productivity. More problems with the infrastructure? More problems to fix and more stress.
Given these demands, many Monitorama speakers discussed which activities to monitor, which information to visually track, and exactly when it is appropriate to wake people on-call to deal with issues.
Some alerts fire off, for example, when a system is restarting or giving a mild warning or status notification. When an alert goes off and there is nothing to do, this can become a huge problem for three reasons.
First, imagine you are focused, working on an important project, and making great progress. Suddenly an alarm goes off. You stop what you’re doing, focus on the alarm, and realize it’s a minor issue that doesn’t require your attention. Then you have to refocus on your important project and get your head back in the game. This is context switching, and there are many negative impacts. In sum, there is a huge loss in productivity.
Second, if an alert isn’t actionable, eventually it will become noise and be ignored. It will cease to help people solve issues. For a great description of this phenomenon called “normalizing deviant behavior,” read Chris Gervais’s blog post. Better yet, watch the webinar recording that is attached to that post.
Third, sleep is one of the most healthy, important, delightful activities, right? If you’re fast asleep and an unactionable alert wakes you up, need I say more about the impact?
Not only should alerts be actionable, but they should also only wake someone up if the alert is pointing to a problem that impacts customers. Legitimate wake-up calls should result in fixes that speed up service, improve functionality, and ensure systems are up and running.
At Monitorama, many speakers stated that alerts should be dismantled, abolished, and even deleted if they don’t impact customers.
When people are on call, often for a week at a time, this doesn’t mean that their daytime responsibilities go away. That’s why it’s especially important that companies implement best practices for on call scheduling, to help them stay healthy and avoid burnout.
So the next time you’re watching Netflix, using Airbnb to book a weekend retreat, and packing the bag you ordered on Amazon, remember the people behind the scenes who keep all your favorite sites up and running. They’ve got your back.
If you want to learn more about this topic and listen to industry experts from the conference, here are links to the talks at Monitorama. And here are some of the sessions I especially enjoyed. • Brian Smith: The Art of Performance Monitoring • Joey Parsons: Monitoring and Health at Airbnb • Nicole Forsgren: How Metrics Shape Your Culture