Matthew Boeckman - December 13, 2017
The Readiness phase of the Incident Lifecycle is the time a team spends focused on learning—about incidents, systems, and themselves. In Readiness, we move the Analysis phase forward into actionable steps to improve. This may be inclusive of architecture and application changes, creating or updating runbooks, or Game Days.
Game Days represent a tantalizing opportunity for DevOps teams, while being fraught with danger. When it goes well, a Game Day can really move a team forward in their ability and confidence to manage failure scenarios. Done poorly, and they can result in downtime and acrimony. In very large organizations, or teams with high engineering maturity, Game Days are a routine matter, but how can your team gain those same benefits?
While many examples have existed through the years, Jesse Robbins is widely credited with coining the phrase and popularizing the exercise of a Game Day while in his role as “Master of Disaster” at Amazon. Netflix has similarly popularized their Simian Army tools, which create some excitingly chaotic ways to break your cloud infrastructure. Another tool in this space to consider is Gremlin, which offers “Failure as a Service”.
The central idea behind a Game Day is to prepare systems and teams to be resilient and rehearsed. Readiness for systems is always inclusive of things like delivery automation, circuit breakers, and independent scaling; readiness for teams must include practical exercises in breaking and fixing things. In all other Incident Management professions, practitioners spend focused time planning and rehearsing responses to different scenarios. Why not DevOps?
I know, you don’t even have to say it. “Amazon and Netflix and Google and Facebook are truly incredible engineering organizations,” you were about to say, “but we’re just 25 people trying to get by. How do we carve off time and focus to execute something like this, without proverbially shooting our foot off?”
Game Days are a risky and tough proposition for teams growing their Incident Management practice, no question. Moreover, imagine the exciting conversation ahead of you while you sit down with the business leaders and slowly and calmly describe your plan to intentionally break things!
An important first step is determining where you are going to simulate failure. Ideally, a non-production environment can be used as your team starts practicing how to practice. Staging and testing environments provide a low-risk proxy learning environment for teams to experiment. If this is your first go at a Game Day, it’s probably best to start somewhere with a containable blast radius. As you get used to both simulating disaster and working together to resolve issues, move your Game Days into production.
I’ve often found myself down a rabbit hole of debate within a team weighing the merits of using a proxy versus production environment. The basic problem of proxy environments comes down to the value of learning in a simulation: how well does it approximate actual production behavior?
Conversely, running an exercise in production is surely real enough, but at what cost to the business? I think the exercise of error budgets here is helpful; if you can roughly quantify effort required to simulate production, that can be weighed directly against the downtime you’re proposing in production.
Wherever you end up simulating some breakage, you should devote some time up-front to planning your event. I suggest treating your early efforts here more like dress-rehearsals for your runbooks. I think for most teams, someone has penned a “what to do if the database fails” document, even if the team has never actually seen that occur. Have a few of those in your stable? Great options for your first Game Day exercise.
What I like about scenarios like this, whether someone has devoted time to writing a document or not, is they are filled with untested assumptions. Assumptions about how systems are going to behave, how long particular operations will take, which metrics we should be mindful of. Testing and verifying those assumptions is a desired outcome of a Game Day. Along the way you get the added benefit of testing the procedure, or suggested actions present within the scenario. Left untested, you will never know until it is way too late how many of those assumptions are true—or badly misleading.
At Netflix, Game Days run regularly, with a wide variety of chaotic simulations in play. For your first run, don’t fire up Chaos Kong and expect things to go well. Look for some risky scenarios, things that are new or poorly tested. A good first start would be to develop two, maybe three scenarios as candidates. Some criteria to consider for scenario selections:
Surviving your first Game Day (and having enough to show for it to justify a second) is all about containing the actual amount of chaos you’re going to introduce. You can already imagine all the exciting ways something like this can go wrong, your job is to ensure it doesn’t.
Spread the exercise over two sprints: preparation and execution. Execution is the relatively easy step; really all you’re doing is blocking the day or afternoon as unavailable in that sprint. Preparation can and should take up a decent chunk of an iteration.
Having developed a couple of candidate scenarios, work through those details. If there are pre-conditions to trigger the incident, get those in place. If there are specific team members participating, ensure they will be scheduled for on-call during the planned time. If you need assistance from other teams to ensure you’ve successfully recovered, get that lined up for The Big Day.
Opinions vary, but I’d encourage a first-time team to discuss the candidate scenarios before hand. Your goal here is to create a small event with valuable learnings, not put the team in trouble. You can retain surprise by varying which scenario is enacted, and the precise timing. On the day of, invoke your favorite random number generator (dice/code/hat), pick a scenario, and fire away!
Whatever your Game Day runtime, make sure to reserve some solid Post-Incident Review time for the group. Order pizza and give everyone time to relax and review and discuss openly how the Game Day went. Unlike real live-fire exercises, this time can be anticipated and accounted for in your planning, so make the most of it!
The clearest metric we can all agree to in Incident Management is Time to Resolve. Certainly teams who practice Game Day exercises should expect to see that metric decreasing as they practice increasingly complex failure scenarios. So too should the beginning team expect to see reductions in the time they spend resolving issues, but I would expect this to be focused in scenarios that are most like the ones you’ve exercised. So, don’t look to Game Days as a simple way to reduce MttR, it’s a strategic play, one that takes time to impact all cases of Incident Management.
I think a different metric to watch, and expect good things from, is the expected vs. actual time spent resolving any specific scenario. Teams just starting out in this practice tend to deeply underestimate the time they collectively spend in Response and Remediation. Estimates for execution of infrequently used tools, like recovery systems, are always going to be off by a wide margin.
Perhaps the most important outcome you should watch for early efforts here is one that’s a little tough to measure: teamwork. Collaboration is hard, as any DevOps team will tell you. Collaboration under stress and time constraints is really hard—late nights with lots on the line tend to bring out the worst in humans. This is why teams need to practice before things catch on fire.
Understanding your teammates approaches, biases, and methods is a key component to an effective Incident Management team. We develop trust and respect for each other by working together, helping each other, and sharing efforts to solve problems. This is the primary success metric to observe as you tackle Game Days: a foundation of collaboration and trust upon which to grow your Incident Management team.