VictorOps is now Splunk On-Call! Learn More.

The Guide to Troubleshooting With Runbooks

Dan Holloran July 31, 2019

DevOps Monitoring & Alerting On-Call
The Guide to Troubleshooting With Runbooks Blog Banner Image

Runbooks and playbooks are maintained as a standardized set of instructions for identifying and resolving incidents in IT service management (ITSM) and DevOps. On-call responders in both traditional IT and software development teams can leverage automation and runbooks to improve the speed of incident response and remediation. By surfacing useful directions and wiki pages with context earlier in the incident lifecycle, on-call teams are instantly ready to jump into action.

Most DevOps and IT teams maintain some form of incident response documentation. But, the level of detail, automation and integration with alerting tools can vary. Without runbooks, on-call teams have to spend more time figuring out why an incident is happening and where they can go to remediate the issue. But, alert-based runbook automation and continuous updates to the documentation can lead to helpful instructions for on-call responders – lowering MTTA/MTTR and making on-call suck less.

So, we’ll go over the basics of runbooks and some steps for using them to improve incident troubleshooting.

What are runbooks?

For recurring issues in DevOps and IT, runbooks are the instructions for resolving those incidents. Runbooks are used to spread organizational knowledge as companies scale and more people need to take on-call responsibilities for services they didn’t write. They’re a way to surface instructions faster and give context to alert data in real-time. At 3 AM, the last thing you want is an on-call team shuffling through numerous tools and data, simply trying to identify the cause of an issue.

Runbooks served automatically alongside an alert can drastically reduce the time spent identifying incidents and responding to them. Thorough runbook documentation can help on-call responders fix issues for systems they’ve never seen. Or, in order to fix the problem, they can quickly see which services or tools they’ll need access to. And, if they don’t have access to those applications or services, they can easily reroute the issue to the right person.

While it can seem like a hassle to build out a runbook repository and keep runbooks up-to-date, it’ll pay large dividends toward service reliability and incident remediation speed over time.

Incident management and real-time response

Incident management in DevOps and IT needs to be approached holistically. Incident management is more than the detection and remediation of an incident – what happens afterward? How are you learning from previous incidents and giving on-call responders the resources they need to become better at their jobs? Conducting post-incident reviews and using those insights to improve alert routing rules and build or update runbooks will continuously improve a team’s on-call operations.

Then, when an incident strikes, the team has the information they need to handle an issue in real-time. This leads to fewer unnecessary escalations and a faster incident response time – driving less downtime and more positive customer experiences. Constant improvement to incident response will inherently improve the efficiency of the rest of the steps in the incident lifecycle. As you become better at the response phase, you learn more about what you also need to improve incident detection, remediation, analysis and preparation.

Runbooks are just one small portion of an efficient incident management and real-time response strategy. In association with intelligent alert automation, collaboration tools and other useful alert annotations, runbooks make on-call responsibilities bearable. So, let’s take a look at some key steps for effective troubleshooting with runbooks.

Incident Management Guide

Steps for effective troubleshooting with runbooks

1) Standardized runbook format and storage

Everyone on the team needs to know where runbooks are stored and the specific format of the runbook. That way, the on-call responder can quickly scroll through runbooks to find the specific information they need to find. If runbooks aren’t automatically attached to alerts in your incident management system, then your teammates will at least know where all of the documentation lives. Standardizing the format and location of runbooks will provide on-call responders with the information they need faster.

2) What, how, who and where?

The answers to these four common questions should be in any runbook – what, how, who and where. What’s the issue? What systems are affected by the incident? Runbooks need to address the scope of the issue and the common reasons why an on-call engineer would be seeing a specific type of alert. Then, they need to be given instructions about how to fix the problem. What actions have been taken in the past to quickly remediate this issue?

The who section can show which person or team should be involved with the incident and why they’re best equipped to handle the issue. Then, last but not least, runbooks always need to include where the issue is, the access required in order to fix a given problem and how teams normally communicate about this specific incident.

3) Requirements for each runbook

The what, how, who and where should be included in every runbook. But, what are the specific requirements for all of the runbooks built by your DevOps or IT team? Are there limitations to the length? Succinct, up-to-date runbooks are actually used by on-call teams and will lead to a more efficient incident response workflow. Laying out the basic requirements for every runbook will ensure runbooks remain useful and don’t become a redundant step in the incident management process.

4) Integrated with chat and alert data

Once you’ve started to build out actionable runbooks and standardized the format and location of internal response documentation, you can start to leverage automation in your runbooks. Automatically appending runbooks to specific alerts and integrating those alerts with your chat tools can make real-time incident response much faster. Runbooks can be used alongside ChatOps tools like Hubot and Slack to surface incident context faster and provide runbooks to on-call responders as soon as they get the alert. Instead of spending time digging around monitoring tools and searching for the proper documentation, automation can immediately surface the correct instructions to the correct person.

5) Update the documentation

But, runbooks are no good if you don’t keep them updated. Every post-incident review process should include takeaway action items. If runbooks were a hindrance to quickly remediating an incident, they need to be updated or removed. What’s missing from your runbooks that could make on-call responders better at their jobs. Runbooks are only as useful as the information inside of them – so make sure you leverage past incident insights to make runbooks as helpful as possible.

On-call automation and making runbooks more available

Documentation and incident response instructions should never be hidden from on-call responders. In fact, transparency around communication and monitoring data is essential to continuous improvement. Automation around runbooks and the on-call process can surface context faster and improve the way people collaborate.

Automation can be put into nearly any process – from on-call scheduling to the execution of commands leading to incident resolution. Automated runbooks are a core component of any DevOps culture focused on collaboration, transparency and speed across the delivery and incident lifecycles. In a modern world of rapid software delivery, a prepared incident response plan is the only surefire way to mitigate major downtime. Automatically surfacing runbooks and incident context in a collaborative forum will drive efficient incident management and, in turn, more resilient services.

It’s true – VictorOps does make on-call suck less! Try a 14-day free trial or sign up for a personalized demo to see how DevOps and IT teams are reducing MTTA/MTTR through a centralized tool for on-call scheduling, collaboration and alert automation.

Let us help you make on-call suck less.

Get Started Now