Mike Meredith - May 12, 2015
At VictorOps, we’re always thinking about ways to improve not just on-call management, but the whole incident lifecycle. Our platform gives you a great set of tools for managing incidents, and the right practices can multiply the value of those tools. I’m going to talk more about the role of the Incident Commander, and how it should be a key part of your Incident Response strategy.
In Emergency Services, the “Incident Commander” is typically the first qualified person on the scene of an incident (say, a house fire). That person is responsible for organizing the incident response, maintaining communication with their chain of command and other authorities, and ensuring that operations are conducted safely and to plan. Sometimes the role of Incident Commander is handed off to a higher-ranking officer or someone from a different agency, but often an Incident Commander will manage the entire response, even if higher-ranking officers are on the scene.
Now, we’re big believers in spreading the on-call responsibility around to the whole DevOps team here at VictorOps. It gives everyone a deeper level of engagement with the platform we’re building, and in our case, it’s also an exercise in using our own product as a customer would use it. Of course, there are many specializations in the world of DevOps, and no one who goes on-call is equipped to solve every problem. That’s where the Incident Commander role comes in. Like the first responder at a house fire, the person who responds to the incident becomes the Incident Commander. Even if the problem lies outside of that person’s expertise, he or she has a critical role to play in the incident lifecycle.
We’ve all probably seen situations where the effort to fix a problem stalls out, because no one thinks it’s their responsibility. Or maybe two people both believe they’re waiting on the other to complete some task. If there are multiple people involved with fixing a problem, an Incident Commander manages the effort, and keeps these logjams from forming. That might mean generating a list of actions to be taken and who needs to take them, and ensuring they get done in order. It might mean facilitating communication between teams, or just helping everyone remember who’s doing what. Anyone on the team can fulfill this role, and in doing so, he or she will get a front-line view of what’s happening on the platform, and how all of the pieces work together. It’s a great way to broaden your horizons beyond the scope of your daily work.
One of the most important things to do during an incident is to capture data. That philosophy is built into the core of the VictorOps platform. Recording and remembering what happened, what was done, and what the state looked like when something breaks has critical benefits. But human intervention is required to make sure the data that gets captured is coherent and useful. And not every interaction gets captured with technology.
Control calls and in-office discussions are required when several people need to make a decision quickly, but verbal conversations don’t end up in the VictorOps timeline. The Incident Commander takes notes on what’s happening and being discussed, and records decisions that are made and actions that are taken. Afterwords, he or she builds a post-mortem report that has a clear narrative of what happened, and what was done to resolve the incident.
So, you can see, it’s not necessary to be an expert in the problem at hand in order to be an effective Incident Commander. Taking on the Incident Commander role allows the people who do have the expertise to concentrate on diagnostics and repair, knowing that someone else is watching the big picture. In fact, it may make sense to hand-off the Incident Commander role to a more junior teammate if the first responder has his or her hands full doing technical troubleshooting. On the opposite end of the spectrum, if an incident is going to take hours to resolve but involves low-intensity work, it may make sense for the engineer doing the work to take over as Incident Commander and let others get their rest.
When you’re bringing developers into the on-call pool, a concern you’ll frequently hear from them is that since they don’t work in operations day-to-day, having them as a first responder means it just takes longer before the person who actually fixes the problem gets notified, and they don’t provide any other benefit. Defining the first responder as the Incident Commander answers this concern by putting them in a critical contributing role that helps the response instead of hindering it. And it ensures that incidents are worked through to resolution, shortening MTTR and improving the quality of your product.