Imagine. It’s 2 AM, you’re on-call, and an alert comes in for a part of the system you don’t normally work on. Naturally, you’re feeling stressed out and a little lost. This is where minimum viable runbooks (MVRs) come in. Minimum viable runbooks include need-to-know, easy-to-navigate instructions for remediating incidents.
Generally, runbooks will provide step-by-step directions for diagnosing an issue, responding to and escalating the alert properly, and ultimately resolving the problem. But, minimum viable runbooks are all about making this process as efficient as possible. Minimum viable runbooks are about optimizing the time spent creating the runbook and balancing that with clear and accurate instructions for incident response and resolution.
Developers and IT operations teams are working around the clock. While building and maintaining services 24/7, teams rarely have a lot of time to create and update runbooks. But, actionable runbooks make a big difference when it comes to maintaining uptime and making on-call suck less. Runbooks give context to on-call engineers and help teams become less reactive–creating a forward-thinking, proactive approach to incident management.
All in all, a minimum viable runbook is a set of instructions with the highest return of valuable information versus the amount of time spent creating it. In order to create effective MVRs, you’ll need to learn from previous incidents and alerts–continuously improving and making the process easier for on-call teams.
By centralizing incident data and collaboration in one place, you get visibility into incidents from start to finish. You can then leverage past incident data to adjust alert rules, escalations, and edit your runbooks. Use automation to put alert data and runbooks right in front of your on-call team when they need it.
A deep understanding of your monitoring and alerting setup will help you define when automation can be implemented to better surface runbooks and other appropriate incident context. Centralized monitoring, alerting, and incident response in one place allows you to better understand incident workflows and optimize runbooks for the applicable situations.
Simply put–what’s the problem? The runbook should show the on-call person exactly where to look to figure out what’s going on. What monitoring tool should they look at? Is there a status page they can check to see what might be going on? The what is focused on the very first thought that will need to occur in order to start working on solving the problem.
How can you quickly fix the problem? The how is focused on finding the first action that needs to be taken to move toward remediating the problem. Should you reset a server? Do you need to find a way to increase capacity? The immediate context provided in a runbook should provide the on-call engineer with the necessary information to quickly start working on the resolution or escalate/route the alert to the proper person or team.
Which leads us to the who. Who needs to get involved in this issue? Can you remediate the problem yourself or do you need to bring in a number of other people to help you find a solution? The runbook should clearly state who needs to be notified of the problem and how you can bring them into the fold.
Where do you need to be to fix the problem? The where is a summary of the digital tools and locations necessary for the specific issue. Where is communication taking place? How are you recording notes, statuses, questions, etc.?
A standardized, succinct runbook is essential for an efficient on-call response. According to our State of On-Call Report, we found that incident response, on average, accounts for 73% of an incident’s lifecycle. So, you can see the importance of finding ways to shorten the time spent in the incident response phase of the lifecycle.
By standardizing the format of your runbooks, on-call engineers know exactly where to look to find the information they need. And when other people or teams get looped into a problem, they also know exactly how to navigate the runbook–even if it’s an issue they’ve never responded to. Something as simple as a standardized runbook format will greatly shorten the time your team spends responding to incidents.
This leads to the importance of intelligent runbook visibility. By centralizing your incident data in-line with chat and runbooks, you’ll have a record of the entire incident response and remediation process. By using an incident management tool that automates alerts and consolidates all of this information into one place, you can improve overall incident visibility.
Also, you can use an automated rules engine to automatically surface runbooks when an incident occurs. Through automation, on-call responders can immediately see the runbook and the incident context they need to quickly remediate the issue.
After incident resolution, your job’s not done. Every team should conduct thorough post-incident reviews to identify what went well and find ways to improve workflows. With a centralized platform for on-call schedules, alerting, and incident response, you’ll have a detailed end-to-end history for every issue that occurs. You can then use this information to deepen the reliability of your infrastructure and ensure your team is more prepared for future incidents.
Don’t forget to take the learnings from your post-incident reviews and inject them into your runbooks. Update your runbooks based on learnings from your PIR. What was helpful in the runbook? What hindered the on-call engineer from figuring out the issue sooner? Was there anything in the runbook that was confusing to the person on-call? If so, you might want to look for some on-call tools that would make incident response easier.
It’s important to take just a bit of time after a post-incident review to improve your runbooks.
Minimum viable runbooks require a small investment of time compared to the amount of time it saves teams during incident response and remediation. On-call teams will thank you for providing the contextual information they need to avoid firefighting in the dark.
Runbooks improve collaboration and spread system knowledge across multiple teams and people in a highly transparent way. Build minimum viable runbooks to champion incident management and start making on-call suck less.
For the deep dive on minimum viable runbooks, you can download our free guide here. Learn exactly how and why runbooks make on-call suck less and lead to faster incident response and remediation.