Runbooks, sometimes referred to as playbooks, are standardized documents containing information and procedures for resolving common IT or DevOps incidents. Runbooks walk through the steps to resolving recurring issues, showing exactly what needs to be done to fix the problem. People unfamiliar with the incident receive instructions and context through runbooks that can help them easily diagnose and resolve an incident.
There are two types of runbooks: specialized and generalized. Both types of runbooks refer to creating helpful documentation and processes for checking systems and resolving incidents as they occur. But, specialized documentation is about fixing specific issues with a server or application whereas general documentation is focused on day-to-day checks to make sure things are running smoothly.
To make runbooks easier for you, we’ve created a checklist for running your runbooks, both specialized and generalized:
Your team should meet and discuss what’s necessary in your runbooks and determine ways to standardize documentation. Designing runbooks for the on-call engineer is important for rapid incident remediation and workflows. The runbook checklist goes over some great ways to standardize runbooks and keep them actionable:
Keep all of your runbook documentation in one place, easily accessible by any parties who may need it. That way, in case of an incident, everyone knows exactly where to look for their runbooks.
Formatting should be consistent across all of your runbooks. This way, SysAdmins, IT professionals, DevOps practitioners, or anyone on-call can quickly navigate the documentation and find the information they need.
In the context of the runbook and the related issue, language needs to be monitored and standardized. This helps avoid any confusion when it comes to approaching a problem or an incident.
Runbooks should be assigned a naming convention. When every runbook is organized by a regulated naming convention, people can easily find the documentation they need, when they need it–speeding up incident diagnosis and resolution.
Outline the goals and requirements of each runbook. In each runbook, specifically lay out the process that needs to be undertaken to resolve the underlying issue. If the runbook isn’t actively helping someone remediate an incident, then it needs to be redeveloped or deleted. Make sure the runbook is achieving goals that align with key business metrics.
Adhering to the runbook style guide checklist will allow you to craft more actionable runbooks. But, we’ve gone through and put together a basic checklist for things you’ll most likely want to add into your runbooks, whether they’re generalized or specialized:
Check all logs for any security threats, application errors, database failures, etc. Someone needs to regularly go through the checklist and make sure that security, application, and system logs are stable and that there are no urgent errors. If there is an issue, it needs to then be prioritized and resolved quickly to stabilize the system.
Regularly backup system files and ensure that all files and active directories are being properly backed up. Make sure that backups are secure, data is retrievable, and that there is an appropriate amount of backup taking place. Clean up and get rid of any unnecessary data, but make sure information is properly backed up before deleting it.
Double check on your monitoring tools and the system’s overall performance. Is CPU usage spiking? Is ETL lagging? Review your alert history, prioritize any high severity incidents, and ensure that everything is functioning normally. Monitoring and alerting on system performance issues will help you create more reliable software and expose areas for improvement in your infrastructure.
Check for any additional hard drive space you may have in case servers begin to max out. In the age of cloud computing and distributed systems, assessing the state of physical equipment is less often a problem for teams. But, it’s always important for server hosts to check and make sure that hardware is in a good state. Double check monitoring thresholds and make sure alerting tools are working the way you’d like them to. Then, last but not least, go over any recent action items that may have come from a recent post-incident review and work on implementing those changes.
At the top of a specialized runbook, give a brief description of the problem addressed by the documentation. At first glance, someone can determine whether they’re working in the correct runbook or not.
Give a brief outline of everything affected by the issue. Whoever is responding to the incident can quickly make sure they have the access or expertise required to look at the problem. If not, they can loop in the proper people and work collaboratively around the runbook.
A very quick list of the tools and applications you’ll need to remediate the issue is always helpful. Again, the on-call person can easily determine whether or not they have the means to resolve the issue on their own.
Now you can dive into the nitty gritty. The majority of the runbook should consist of specific step-by-step instructions for fixing the issue, including screenshots when helpful. By following this format, the on-call engineer can first identify whether they have the means to respond to the incident, then they can follow the steps to resolve it. Adding some sort of conclusion to your runbook helps the responder understand when an issue is fully resolved.
At the bottom of your runbook, you should include a link to fill out post-incident review documentation. From there, you can start working on a post-incident review and learn from the problem.
After resolving the incident and filling out the post-incident review, you need to determine action items. These action items can apply to improving monitoring or alerting processes, speeding up incident response, or they can include certain ways to improve your runbook. Based on previous incident data, figure out what you need to add or remove from the runbook in order to make it more actionable.
Runbooks are just one important aspect of incident response and remediation. Download our free Incident Management Buyer’s Guide to learn about other necessary functionality for creating a holistic incident management process.