Mike Meredith - January 25, 2013
Time on-call is a fact of life working in a DevOps or TechOps environment, but for a lot of us it’s the worst part of the job. Working with a 24/7 platform, on-call means getting up in the middle of the night, interrupting weekend time, and putting personal life on hold. And it’s stressful! It’s easy to feel alone during a crisis, not wanting to bother coworkers but needing help, advice, or just another set of eyes.
Here’s a few easy things you can do to lessen the pain your team feels during on-call duty and help speed up issue remediation:
1: Pick the right handoff day. Mondays and Fridays are frequently interrupted by long weekends and holidays. If your handoff is a manual process or you’re passing a physical pager around, forgetting to do the switch on a Friday could mean someone’s stuck with an extra weekend on-call. Try doing the on-call handoff on Wednesday. Most everyone will be in the office, and you won’t have the distractions of the previous or the coming weekend interfering with an orderly handoff.
2: Hold a handoff meeting. Most people don’t pay too much attention to the monitoring system when they’re not on-call. This is a good thing. We all need to regain our sanity and tune out when we can. But this means that when your turn does come up, you may not be prepared for everything that’s happening. New issues you haven’t heard about may have arisen in the past week.
I’ve found that getting everyone on the team together for a short weekly meeting on handoff day can greatly ease the transition. Whoever was on-call can talk about the big events of the previous week. If they’ve noticed anything trending in the wrong direction they can provide some early warning. This is also a chance for the whole team to see and hear who is on-call, and to check in on how the platform is doing.
3: Use the buddy system. If you work in a large enough organization, consider having someone from each of the disciplines (development, infrastructure, database management…) on-call at the same time. You can split up alerts based on who should be responding, and everyone on-call will know they can contact the others if they need help out of a jam.
This can lead to quicker remediation when problems arise. I know that sometimes at 3:00 in the morning, the solution to a problem can be staring me in the face but I’ll be too sleep-deprived to see it. A second set of eyes with a different perspective can be invaluable at a time like that.
4: Make an escalation plan. There’s no worse feeling then being on-call, confronting a problem you can’t fix on your own, and not knowing where to turn for help. That feeling of having no backup can bring a person to a panicked state where he or she starts making bad decisions.
Step one of preventing this is keeping a contact list for everyone in the ops crew both up-to-date and available. Publish it on a wiki or keep it in LDAP or Active Directory, whatever you like. But make sure the information is out there and everyone can get to it wherever they’re at.
Step two is setting an orderly process for getting more people involved. Make sure everyone on the team’s areas of expertise and skill sets are known by everyone else. For each discipline, have a published list of who to call and in what order. Ensure that someone takes ultimate responsibility as the end point for escalation.
5: Go Mobile. A lot of monitoring systems seem to be stuck in the dark ages, and the modes they use to get information out can be antiquated. Email, SMS, and even text pagers can be effective ways of hearing about a problem, but they’re generally lousy for doing something about it. For most of the time that I’ve been in the industry, actually doing something meant getting to a computer and/or getting to a data center, all too often at the expense of dinner, poker night, or some other attempt at having a normal social life.
These days of course, most people carry a computer in their pocket. Smart phones may not be an ideal way to manage complex systems, but apps exist that can enable remediation for a lot of problems without having to go home or hunt for a wifi hotspot. By adopting technologies and policies that allow this potential to be realized, you can reduce time-to-remediation and save marriages at the same time.
Applying these ideas in your organization can help make on-call life easier. In a 24/7 shop this can be a key factor in job satisfaction. As a DevOps manager I love helping people do their jobs better so they can live their lives better. That’s one reason why I’m so excited to be working for VictorOps. In the coming months, we’ll have a lot more to say on this subject. Stay tuned!
(Photo credit: Sarah Nuese)