VictorOps is now Splunk On-Call! Learn More.
Taking on-call responsibilities, responding to incidents, and remediating those incidents takes organization and preparation. Without a plan for navigating an alert when it comes in, your team will be lost in the dark. That’s where the incident commander steps in. An incident commander is the first qualified responder to an incident, with the expertise required to understand the situation and take the proper steps toward quickly fixing the problem.
The incident commander should essentially act as a project manager for incident management. But, in a DevOps culture of collaboration and transparency, the incident commander need not be limited to one individual role. In DevOps, with shared on-call responsibility between developers and ops teams, there should be a number of people on your team who can act as an incident commander.
Really, the incident commander will always be the first person taking the first action toward incident remediation. By creating a system for intelligent alert routing that provides incident context quickly, nearly anyone on your team should be capable of acting as the incident commander.
So, let’s break down the responsibilities of an incident commander.
Incident commanders need to be quick on their feet, and they need integrated tools for on-call scheduling, alert routing, communication, and monitoring data. Any possible way to add context to alerts, deliver that context to your team faster, or automate ticky-tack processes will help an incident commander.
An incident commander’s ultimate responsibility is to understand an alert, assemble the proper team, and figure out how to remediate an incident–as quickly as possible.
Through continuous improvement over time, life as an incident commander should get easier. With each new incident you’ve responded to, you’ll continue to learn about your system and identify areas for improvement. But, knowing an incident commander’s responsibilities beforehand will help you clearly define a process for speeding up incident response and remediation.
The first and foremost responsibility of an incident commander is actually assigning the incident commander. At first, this may seem odd. Because the first on-call responder may not be the first person with the expertise to understand the specific incident, they may not be best suited as the incident commander. Ideally, the first person to receive an alert will become the incident commander–but this won’t always be the case. But, the initial on-call engineer should be able to at least route the alert to the correct person who should become the incident commander.
Identifying the problem is possibly the most important responsibility of an incident commander. Remember, the incident commander isn’t always going to be the person who’s actually remediating the incident, but they will be the first point of contact. So, the faster the incident commander can understand the context of an alert, the faster they can begin working on a resolution strategy.
Think of it this way, an incident commander without alert context is like an architect without building plans–or an architecture degree. The incident commander not only needs to understand why an incident is affecting the system, but they also need to be capable of determining the next course of action.
Through automation, collaboration tools, and constant upkeep of your monitoring systems, you can arm the incident commander with the information they need to understand the incident. Then, with an understanding of what’s wrong, they can either escalate the issue, route the alert to the right person or team, or begin resolving the incident themselves.
Once the incident commander identifies what might be happening in the system, they can start building a course of action. Typically, the first course of action is to open lines of communication with the applicable team members who need to get involved. This responsibility of being an incident commander is why we preach the importance of a centralized incident management tool. You can loop-in teammates and get them involved, and quickly up to speed on the incident, by bringing alert context in-line with communication tools.
Now the floodgates are open. You’ve got the right people working on the issue, but what other information does your team need to speed up incident resolution? The incident commander should remain heavily involved in chat and continue to actively assist in resolving the issue until it’s finished. The incident commander needs to have visibility into an incident’s status and act as a resource for providing their team with additional incident details or alert information as they need it.
Because the incident commander is privy to the ongoings of the incident, they can update key stakeholders. The incident commander can help keep the status page updated and let end users know what’s going on. At the same time, upper-level management clearly has a vested interest in maintaining uptime, and they would likely enjoy visibility into incidents and how it’s affecting their bottom-line. Incident commanders can help create transparency in your incident management, while simultaneously helping you build trust with both internal teams and end users.
Depending on your organizational structure, the incident commander doesn’t necessarily have to be the person leading a post-incident review, but the incident commander should at least attend any post-incident review about a problem for which they were involved. The incident commander will have the most visibility into an incident–from beginning to end. So, of course, the incident commander should be required to join in on post-incident reviews.
Conducting detailed post-incident reviews will prepare your team for the next time a similar incident happens–helping you mitigate downtime and alert fatigue, and allowing you to build more reliable systems.
Don’t let being an incident commander scare you. The responsibilities outlined above will make on-call suck less and help you determine a process for making each incident just a little bit easier than the last. Although you can’t prepare for every possible incident or scenario, understanding an incident commander’s role in response will help you navigate stressful on-call situations.
Arm your on-call teams with the tools and confidence they need to handle incidents on their own terms. Incident commanders can help you quickly organize incident response and keep the team informed and productive throughout the entire incident management process.
VictorOps is purpose-built to centralize incident management workflows and data to make on-call suck less. Try a 14-day free trial to see for yourself why incident commanders love VictorOps.