Disaster. That word gets used a lot in our circles–it’s a trigger to the deepest FUD argument a vendor or colleague can make. A disaster can be defined in any number of ways: the number of customers impacted, revenue loss, or the number systems impacted. There are many metrics by which a disaster will be judged. For an on-call team however, the tale of a disaster is told in the minutes and the hours.
Much like a security breach, the reality of a systems disaster is not so much “Can it happen to us?” as “It hasn’t happened to us… yet.” Backup, restore, failover, resilience and scale are all terms that we focus on in disaster planning. We write Disaster Recovery Plans. We test failover patterns. We restore data from backups. But how much do we prepare our on-call teams for the long haul disaster event?
Most Disaster Recovery Plans focus on what are perceived to be the highest risk and highest probability events. Highest risk? A comet strikes our office headquarters. Gosh that would be bad, right? The odds are pretty low, so let’s not really spend time thinking about that particular eventuality. So many disaster planning conversations end with “Well, if that actually happens, we’ll be screwed anyway.” You are honestly admitting that you won’t know how you’ll react in that eventuality.
That’s probably a reasonable response to the myriad of factors that will create and define your disaster. While you can’t game every potential disaster, you can be thoughtful beforehand about how your team will react and perform during this most critical time.
Early during the event, even in the middle of the night, team dynamics are pretty straightforward. Everyone is keyed up from adrenaline and the excitement of the moment. Engineers love a good mystery, and will happily roll up their sleeves and dig-in. Two hours later, with several areas of investigation coming up empty, business pressure to resolve a complex incident escalating, and the effects of adrenaline long worn off, things get more touchy. Ten hours in, the wheels are going to come off.
Ten hours is by no means the upper bounds of how long a team may be embroiled in a problem. As a few recent events have highlighted, disaster can occupy your team for 24 hours or more. My heartfelt thanks and empathy go out to teams at Gitlab and Instapaper, for example. Thank you for being so transparent and sharing full postmortem reports with the community, and empathy because a 20+ hour battle is draining on every level: physical, emotional, and intellectual.
You don’t have to know much about human nature to appreciate the behavior that these kind of conditions trigger. Interrupted plans, stress, exhaustion, and frustration can turn the most congenial colleague into a Joe Pesci-esque sparkplug of rage and angst. Interactions become tense, old wounds flare up, and the human beings working this problem start to devolve into having the general emotional discipline of a six year old. We’ve all seen a colleague or friend bested by these kind of conditions.
How you manage an incident should serve as a pattern for how you would respond to a Disaster (with a capital D). Many Disasters start life as a simple incident: a quiet, unassuming alert about disk space somewhere. A couple of webservers quit responding. A temperature sensor triggered. Before you know it, you are in the fight of your life, and good incident management practices can ease the path.
Know beforehand which people serve as key escalation points within your business. I’ve previously discussed the role of a communicator in incidents (The Scribe). Whether you adopt that idea, or roll your own, plan out how your team will communicate with a wider audience. Ideally, it’s the same way, tools, or people who communicate during regular incidents.
People have this unfortunate feature of requiring sleep. Dedication, adrenaline, and passion can go a long way, but our brains need rest. Cognitive function deteriorates rapidly as we pass 18 hours of wakefulness. The incident coordinator must be mindful of breaks, rest periods, and rotations to keep a team functioning.
During a Disaster, the team is going to try things. Make changes. Begin restore processes. Investigate potential remedies. All of this must be captured, and done so in a reasonably coherent way. The Disaster document is going to serve as the basis for your eventual post-mortem. It’s going to contain the talking points for your outward communication with the business. It can serve as an onboarding document as secondary escalations kick in, or as teams rotate off shifts.
Most importantly, after you’ve solved for the specific event, you have to come back and clean all this up. Your digital environment is likely to closely resemble the physical space your team inhabited during the fight–crumpled piles of paper, half-eaten pizza crusts, whiteboards with nonsensical boxes hastily scribbled to convey an idea. You’ve got to clean all this up and return the system to normal operations. Your change document is your guide.
Distractions can come in many forms, and keeping your team focused on working the problem is your only path forward. Perhaps the most demoralizing distraction certain to appear is a leader, a manager, or an executive who routinely reminds the team how bad this is. The pressure from the business will be huge. It’s human nature to try and remind everyone how important it is to recover. You must block this kind of input to the team.
In all my years of managing teams and responding to a Disaster, I’ve never seen a team member who somehow failed to grasp the gravity of the situation. I have never even heard of a team member who, facing a full scale database failure, for example, decided that they should take a casual approach to remediation. Hammering home in insistent menacing tones how bad this is will accomplish nothing more than to elevate already massively spiked stress and anxiety in the human beings on your team.
Fostering panic will not help.
We are all, ultimately, human beings. Each of us, whether leading the team, working the problem, or moderating communication, owe our colleagues our best. The natural inclination in a close-knit team is to rely on long established bonds of trust to soothe over harsh words. We must not take those bonds of trust for granted. Help each other out. When someone inevitably proposes a terrible idea, don’t tear them down. They’ve been awake for 30 hours after all. If you see team members struggling, pull them aside. A quiet reassurance, a compliment, an expression of gratitude will all go a great way toward rebuilding a human being’s state of mind.
Emotional health and balance are deeply connected with cognitive performance. The best response to a Disaster is calm, collaborative, and focused teamwork. Mindful of the needs of each other, we can meet these unpleasant eventualities with the best parts of our nature. I sincerely hope none of you have the opportunity to experience lots of iterations in Disaster management. In lieu of iterations, I hope that these suggestions can help ease your pain when the worst events strike.