VictorOps is now Splunk On-Call! Learn More.
Runbooks and playbooks are essential to driving collaborative on-call response and incident management. Especially as teams scale, new on-call teammates need to have resources that make on-call suck less. Runbooks are one of the best ways to provide incident response and remediation instructions to on-call responders. When you automate those runbooks to appear in-line with alert context and chat tools, your team can collaborate more quickly and find the information they need, when they need it.
A runbook or a playbook should provide the on-call responder or incident commander with the information they need to start working toward resolving an incident. For most simple resolutions, the runbook can simply show the exact steps required for fixing the problem. If the issue is a little more complex, sometimes a runbook showing where to escalate the incident may be more efficient.
Determining the what, how, who, and where of the incident and then feeding that information into the corresponding documentation should be required for any runbook or playbook. In previous posts, we’ve referred to the bare minimum for creating a runbook as a Minimum Viable Runbook.
The what of a runbook should help the on-call responder know exactly what the issue is. The faster you can help identify the exact problem, the faster the team can jump on it. Is it a network error? Is server space maxing out? Any well-built runbook will almost immediately address what’s likely going on and help the on-call engineer start navigating the incident management process right out the gate.
The how will help the first responder determine the first action to be taken toward an incident resolution. What are the first actionable steps an on-call team or individual can take toward a resolution. Instead of diving straight into logs and investigation, an intelligently structured runbook may be able to simply tell you exactly how to remediate an incident. Because the runbook quickly identified the what and the how, you can start taking action more quickly.
Now, the what and the how should always be the first two things included in the runbook. But, the on-call responder may not always be the correct person to simply hop on a problem and fix it. The who can help you address who needs to be involved with the incident? Is it a team? Is it a specific individual? This way, you can quickly escalate the incident to the right person or team instead of spending time working to find out who should be responding to the issue.
The where can show you exactly where the problem is and where the resolution needs to take place. What specific applications, tools, servers, etc. need to be looked at for the problem at hand? Where will the team be communicating around the incident? Understanding exactly where every part of the incident workflow will take place for the specific incident should be known across the entire team.
While having a centralized location full of manually stored runbooks is a great start, automatically serving runbooks to the on-call engineer is even better. The faster you can surface context and provide instructions to an incident’s first responder, the faster you can actually start fixing the problem.
In our State of On-Call Report, on average, we found 73% of an incident’s lifecycle is spent in the incident response phase. So, finding ways to automate processes, improve collaboration, and shorten the incident lifecycle should be your first priority. Runbooks, especially automated ones, are some of the most effective ways for improving a team’s overall incident management productivity.
In association with on-call schedules, runbook automation can be used to automatically serve an on-call engineer with runbooks based on specific alert identifiers. This way, the correct runbook is served to the correct responder at the same time as the alert context. So, the on-call engineer doesn’t need to work in different locations or spend time finding the data or instructions they need for remediating the incident. Runbook automation drives operational efficiency and helps on-call teams better understand any issue placed in front of them.
A solution centralizing runbook automation, chat, and alert data is the most efficient way to help create a better human-centric on-call experience. Without context-switching, people are better at quickly identifying a problem, diagnosing what’s wrong, and working toward a resolution. A centralized timeline of chat and alert data, in association with runbook automation, leads to higher visibility across multiple teams and makes incident workflows easier.
Of course, the first step of using runbooks is creating them. Then, you can set rules in order to automate runbooks for specific alerts and incidents. But, your work isn’t done there. With the complexity of systems and the ever-increasing speed of CI/CD, you need to continuously update current runbooks and create new ones. So, creating a company culture dedicated to collaboration and continuous improvement is also highly important for maintaining actionable runbooks.
As your team continues to talk and offers more visibility into workflows, you’ll expose weaknesses in your service or infrastructure, helping you find areas for improvement. This logic also applies to runbook creation and automation. You can’t simply create a runbook, set up automation rules one single time and walk away. Monitoring thresholds need to be constantly readdressed, runbooks need to be updated in association with updates to your infrastructure or application, and new automation rules need to be set up.
Don’t get complacent with runbook automation and updates. Runbooks are only as useful as the team using them and the information contained inside of them. Collaborative teams will leverage runbooks more effectively and work together to keep making runbooks better. Continuously improving the content of your runbooks and the automation rules around serving them is essential for collaboration, operational efficiency, and speedy incident management.
VictorOps provides a centralized dashboard for alert context, communication, and actionable runbook automation. Sign up for a 14-day free trial to start collaborating around incidents and automating runbooks, alert routing, and on-call scheduling–all in one place.