Preparation is the key to effective on-call management and faster incident remediation. From our State of On-Call Report, we found that incident response, on average, accounts for 73% of an incident’s lifecycle. So, you can see how preparing people for incident response can reduce the costs of downtime and lead to highly available systems.
This is where game days come in. Game days are an offshoot of chaos engineering designed to test and improve the human experience of incident response. Through artificial scenarios, you can trigger some type of alert, and see how an on-call person would respond to the issue. By going through the actual steps of diagnosing, responding to, and remediating an incident, you’ll expose ways to better prepare for the future.
My First Game Day
Recently, I had the pleasure of joining in on my first game day. As you can imagine, as a content marketer, I’ve never been put on-call for our own systems. Of course, I work inside the VictorOps platform quite often and spend lots of time researching on-call, DevOps, SRE, etc. to better empathize with our customers, but I’d never truly felt the pressure to remediate an incident in real-time.
So, for me, it was an enlightening experience to actually take on-call responsibilities. Also, this scenario was helpful for showing how an on-call engineer, with no exposure to the system involved, would respond to a triggered alert. The game day allowed me to witness the difficulties of on-call, and see firsthand how VictorOps can actually make on-call suck less. And second, it was a ton of fun.
Overview of the Game Day
In the game day, our web development team set up a scenario to simulate a broken content build. In one scenario, I would be the on-call responder. And in a second scenario, my colleague would be on-call for the same issue. The game day was simulating a situation where we would cause an error when attempting to deploy new content to the website through our content management system. This error would send an alert to VictorOps where we would be notified through our personal paging policies.
The game day was designed to show us a number of things:
- Ways to improve visibility for content deployments and errors
- Expectations held by new on-call responders
- The effectiveness of the associated runbooks
- The importance of system exposure when on-call
For the game day scenarios, we had three observers, an on-call responder, a dungeon master administering the chaos, and a subject matter expert–a developer from our web team–in case the game day’s incident required escalation. Our developer was working remotely, making it so we couldn’t simply ask him for help if we ran into trouble. It forced us to work completely within VictorOps and our content management system to fix the problem.
What Happened in the Game Day?
When the game day came around, I stepped into the world of on-call. The content build failed and I received an SMS message saying that an alert had been triggered. I quickly acknowledged the alert, logged into VictorOps, and started to look at what happened. Even in a fake scenario, I felt pressure to rapidly respond to and remediate the incident.
Right off the bat, I saw how important it is to receive context with your alert. As I opened VictorOps to start addressing the alert, I clicked into the alert payload and looked to see if there were any annotations with the alert. Along with the alert payload, I saw an annotation with a link to a runbook wiki. I clicked the link and immediately realized I didn’t have access to the runbook.
Already, after less than a minute, I was looking for a new way to respond to the problem. Without step-by-step instructions and little understanding of what could have caused this alert, I ventured back to the incident details in VictorOps. In the alert payload, I saw a link to a page in our content management system. I clicked the link and was taken to a familiar page. (Note the importance of a recognizable system–I’ll discuss this later in further detail.)
The page held a graphic asset that was archived and nested inside of a blog post. I immediately noticed this as an issue. Because of my previous work in the content management system, I knew builds would fail if a nested asset were unpublished, archived, or in draft mode. Almost immediately I unarchived the asset, tested the build again, and it appeared to work.
My Colleague’s Game Day Experience
Much of my colleague’s experience with the game day was the same as mine, but there were still a couple key differences that taught us a lot. For one thing, my colleague was able to get access to the runbook when looking into the incident detail pane. But, one of the most interesting findings of the day was that runbook access actually slowed down her response.
Runbooks are beneficial when they are organized, clear, and succinct. But in this case, my colleague would’ve been able to diagnose the problem and resolve the incident more quickly if a runbook were not available. She immediately unarchived the asset to fix the problem, but then became confused by the runbook, and rearchived the asset. Without a runbook, she would have trusted her instincts and historical knowledge of the system in order to fix the problem.
Eventually, my colleague solved the problem. But, incident remediation was slowed by a convoluted runbook and a lack of understanding for identifying a fully resolved incident. Game days such as this can show weaknesses in your runbooks and surface ways to improve incident visibility.
What I Learned
There was a lot of fun (and stress–even in a fake scenario) involved with the game day. Over my course of working at VictorOps, I’ve loved to learn about chaos engineering and the usefulness of DevOps. So for me, it was an awesome experience to take on-call responsibilities and see why what we do is so important. I wanted to summarize this post with a few key things I learned from the experience:
When it comes to incident response, I can’t stress the importance of exposure to the systems you’re working to fix. Because of my understanding of our content management system, I was able to address the issue quickly–even without access to a runbook.
Runbooks need to be clear and well-formatted. In our game day, our runbook was actually too detailed. It added too much information to delve through in order for my colleague to easily remediate the problem.
Alert context is a requirement. If I simply received an alert without any kind of alert payload or annotations, I would have no idea where to start. Then, I’d simply be stressed out, confused, and I would’ve involved colleagues in issues I could fix myself.
Real-time chat is extremely useful, whether it’s native or through an integrated chat tool. My biggest problem throughout the whole game day was finding a clear indicator that the incident had been resolved. Luckily, I could simply tag my subject matter expert on the alert and he was able to get back to me almost immediately.
Post-incident reviews are essential. After the game day, we all had a roundtable discussion about how we could improve the on-call process–many of my learnings came from this discussion. We identified weaknesses in the runbook, areas of confusion in the incident response process, and the resources someone would need to resolve this issue if they’d never been exposed to the system they’re looking at.
On-call sucks–but we make it suck less. In VictorOps, you can leverage our rules engine to automatically transform alerts, set on-call schedules, automatically route and escalate issues, chat, and collaborate in-line with incident details. Sign up for a 14-day free trial see for yourself how you can start improving incident management.