Your Guide to Surviving On-Call
Being on-call is often nerve-wracking and time consuming experience for engineers, but it doesn’t have to be. Using DevOps processes and tenets can save you and your team a world of hurt in an on-call firefight. Follow these tips and tricks to make your next on-call experience suck less.
What Happens When You’re On-Call?
Ever swore your cellphone is vibrating, but when you check you find no calls or alerts?
You’re not alone if you feel phantom vibrations when you’re on-call. It happens most often when you’re expecting (or dreading) an incoming call or text. In fact, its so common that is has a medical name: Phantom Vibration Syndrome.
“With the phantom vibrations, the brain sometimes misinterprets sensory input according to the preconceived hypothesis that a vibrating sensation will be coming from the phone. In other words, it seems smartphone users are just so primed for, and attentive to, the sensation of their phone going off that they simply experience the occasional false alarm.” (source)
What do people do to stop Phantom Vibrations?
Or, you could always stop the dread and expectation at the source: by creating a playbook to deal with any and every on-call issue that arises.
On-call affects your sleep
According to a study done by the National Institute of Health, being on-call is bad for a good night’s rest in a lot of ways.
“Rated sleep quality was lower, and sleepiness was higher during the subsequent day. It was suggested that the effects were due to apprehension/ uneasiness induced by the prospect of being awakened by an alarm.”
Specifically, being on-call results in less restorative sleep, less slow wave sleep, and an overall higher heart rate. All of which add up to an overall more tired day-after. On-call stress has been proven to affect on-shift performance, recover, and long term health and well-being…when executed in the traditional way.
So what can be done to make on-call suck less? Thankfully, here are some steps to take before, during and after and event that will make the entire on-call experience much easier on your long-term health and well being. Plus, you’ll improve your on-shift performance and recover faster as well.
What To Do Before You’re On-Call
Step 1: Set Up Alerts
The correct configuration of your alerts is crucial. A solid alert configuration covers three main aspects.
Personal contact methods
Be sure that the alerts are getting to you the way you prefer by customizing how you get notified, whether that’s SMS, email or phone. You can also use more than one way of communicating if you’re afraid of missing anything. (For example, first send a text and then call me if I don’t respond in 5 minutes.)
The feeling of having no back-up can bring a person to a panicked state where they begin making bad decisions. You can easily avoid this by having an escalation plan in place ahead of time. These sort of escalation policies are easy to set up so that everyone on the team knows when more people need to be involved.
It can be beneficial if certain alerts can be seen by certain people, so when setting up alerts, be sure to route them to the team that is most capable of solving the problem. Additionally, there are times when specific alerts should be seen by different people, simply for the sake of knowing what’s happening with the infrastructure, as in the case of a CTO or SVP of Engineering.
Step 2: Pick the right handoff day
Try doing the on-call handoff on Wednesday. Most everyone will be in the office, and you won’t have the distractions of the previous or the coming weekend interfering with an orderly handoff.
A day can make a big difference.
Mondays and Fridays are frequently interrupted by long weekends and holidays. If your handoff is a manual process or you’re passing a physical phone around, forgetting to do the switch on a Friday could mean someone’s stuck with an extra weekend on-call.
Step 3: Hold a Hand-Off Meeting
Most people don’t pay too much attention to the monitoring system when they’re not on-call. This is a good thing. We all need to regain our sanity and tune out when we can. But this means that when your turn does come up, you may not be prepared for everything that’s happening. New issues you haven’t heard about may have arisen in the past week.
Share the Knowledge
Getting everyone on the team together for a short weekly meeting on handoff day can greatly ease the transition. Whoever was on-call can talk about the big events of the previous week and if they’ve noticed anything trending in the wrong direction, they can provide some early warning. This is also a chance for the whole team to see and hear who is on-call, and to check in on how the platform is doing.
Step 4: Use the Buddy System
If you work in a large enough organization, consider having someone from each of the disciplines (development, infrastructure, database management) on-call at the same time. You can split up alerts based on who should be responding, and everyone on-call will know they can contact the others if they need help out of a jam.
Step 5: Go Mobile
By adopting technologies and policies that allow for remote problem solving you can reduce time-to-resolution and save marriages at the same time.
These days, you carry a computer in your pocket. Smart phones may not be an ideal way to manage complex systems, but apps exist that can enable remediation for many problems without having to go home or hunt for a wifi hotspot.
Crisis Management Tips
Remember these tips when you’re in the trenches.
Tip 1: Clear Heads Solve Problems
We’ve all experienced the situation where the solution to a problem is staring you in the face but you’re too distracted to see it. Jumping right into a problem when you’re still thinking about that jerk that cut you off in traffic will only increase your stress and decrease your focus.
Tip 2: Don’t Try to Take That Hill All By Yourself
Be wary of hero culture.
Techies love to tell stories of when they singlehandedly saved their Fortune-500 company from disaster using only their wits. Here’s a hint: most of these stories are nonsense. Taking sole responsibility for solving a major problem puts too much pressure on you, and makes it harder to focus.
Besides, you’re managing a big distributed platform, right? There’s a good chance that no one in your organization perfectly understands every moving part. Even if you are dealing with things you know, a second set of eyes might notice that minor typo in the config that you didn’t see.
Tip 3: Don’t Wake Up the Whole Team
Preserve the reserves!
When you reach hour 12 of the recovery process, you’ll be grateful that there are fresh, rested people ready to take over and give the first responders a break.When you reach hour 12 of the recovery process, you’ll be grateful that there are fresh, rested people ready to take over and give the first responders a break.
Also, work in shifts for extended crisis.
Sometimes the problem is so serious that, even when the root cause is found, it’s clear that it will take hours or even days to get it fixed. Get enough people on the problem to fix the problem, and let everyone else keep on with whatever they’re doing.
Tip 4: Remember Your Team is Just as Stressed as You Are
When you’re in the thick of things, it’s easy to get angry. Maybe someone seems more distracted than you’d like. Maybe someone makes a joke to ease the tension and you don’t think it’s funny. Maybe someone gets short-tempered with you. Relax, and put things in perspective.
Some advice: A 100 years from now, nobody will care what happened today.
Remember that everyone who’s dealing with the issue is feeling the pressure, maybe even more than you are. Lashing out in anger will only escalate the stress on everyone, so don’t do it. Try to keep an even keel, and you’ll notice that your calm demeanor can help diffuse other people’s panic and anger.
Tip 5: Put Off the Blame Discussion
Try to foster a culture where troubleshooting happens in a safe space.
Sometimes, a problem will arise because of a human error. It’s important to know what happened to cause the problem so that everyone can learn from the mistake. But that doesn’t mean that the troubleshooting and repair process should grind to a halt in favor of finger-pointing.
Your Post-Mortem Checklist
Tools for the next incident.
(Just in case.)
As a team:
• What happened?
• Who was affected?
• What was done to fix it?
• How was the business affected?
• What can be done to prevent this from happening again?
As an individual, team members should be able to account for the following (without retribution):
• What actions they took at what time?
• What effects they observed?
• What expectations did they have?
• What assumptions did they make?
• What is their understanding of the timeline of events as they occurred?