Nick Isaacs - November 13, 2014
Here at VictorOps, our mission is simple: make on-call suck less. We seek to enable those who develop and maintain great software to keep systems running, and decrease the Time To Resolution for you and your customers. To help achieve that end, I wanted to take some time to discuss at length the glamorous, sexy, high-stakes world of Runbook best practices and how they relate to the on-call experience. This post will discuss how good runbooks are created, used, and updated to help fix problems.
We believe that receiving an alert is the start of the on-call experience. Fixing the problem, documenting the solution, and eliminating the root cause are where the real work begins. A great way to close the loop on alerts is to have an on-call handoff meeting with the next on-call person. Things to cover in a handoff might include: - Discussing each alert received - Were these alerts actionable? - If so, append your solution to the system runbook if not already present - If not, make sure to take note of the source of the alerts and seek to fix.
New code should also receive the same attention to make sure it is ready to be maintained the day it is shipped. It’s no secret that we developers hate documenting our own code (“Documentation? Why dont you just read the code!”), but biting the bullet and having some explanation will go a long way to help our brothers (and sisters) in arms fix problems that arise during the course of execution. Some easy ways to help reduce the Time To Resolution for new features/fixes are: - Having a runbook in place when new application code is released - Having well-versioned documentation - Including what has changed (organized by version) - Documenting which pieces of the system are monitored.
A great use of the new VictorOps annotations is to link to runbooks. We have built our runbooks with Github Wikis. We create runbook pages to reflect releases (by date), as well as entries to address common alerts that we receive (high volume alerts, system slowdown alerts, disk space alerts, etc.). The clearer the instructions on how to solve a problem, the better. Whether or not these alerts should be addressed from an engineering perspective is the realm of product management, we will assume that all common alerts, features, and releases should have associated runbooks.
To keep our runbooks useful and actionable, we strive to cover a few key topics when writing them. These topics are: - How to triage the problem - A desired outcome - An escalation path if you get stuck
Triage of the problem mainly consists of answering the question “Is this a customer-facing issue?” and if it is, we must drop everything and get to fixing. The desired outcome may be different for every issue (aside from the general “make the alert stop”). Desired outcomes speak toward identifying the contributing factors of the alert. A DOS attack would have a very different desired outcome; identifying the source of the attacks and blacklisting; when compared to a high latency alert. Lastly, we want to make sure that we never leave those on-call without an escape route. Not everyone who takes up the gauntlet of on-call will have every piece of knowledge, access privileges, or experience to fix every problem we will encounter.
In addition to having great runbooks, is how easy they are to access when they are needed. We want to get meaningful, and helpful information, into your hands as fast as possible. To do this, we created Annotations. Annotations allow you to transform and append fields of alerts and display relevant data.
There are two basic features that Annotations perform: changing the data of alerts, and adding custom fields to alerts. In addition to this, we have added the ability to preview this important data natively, without ever having to leave the VictorOps application. Need to link to a runbook for a particular problem? Done. Interested in adding a Graphite graph based on host? No problem. What about rick-rolling the guy on call? Hilarious, but karma will not forget. Annotations help get data to those who need it.
Lets us assume we have the following annotation structure:
This would indicate that if we get an alert with a
HOST field that matches the regex
db*, we would get a link to the
annotation_url if it exists (annotation_url is a common nagios field for appending notes to alerts). If we instead got an alert with a
SERVICE field that matches
sms, we would receive an annotation that links to the sms page of our wiki. The limit of annotations is that there is no extra processing that can happen outside of direct substitutions. There is nothing in place to prevent the following (contrived) annotation rule:
from generating the following output: https://www.githhub.com/our_org/our_project.wiki.com/https://app.domain.com/foo/bar/baz .
With this limitation in mind, we are free to get as creative as we want with our annotation scheme. How about using alert fields as query string parameters? Wouldn’t that be clever?
We love runbooks here at VictorOps; they serve to make the lives of those on-call easier and less stressful. If your team is thinking about starting a runbook discipline, start small: try adding a runbook for the two or three most common alerts and expand from there.
If your team does have a great selection of runbooks, make sure these files are not write-only. Update, or even delete, runbooks that no longer reflect the true state of your systems. Thanks for joining us and as always we would love to hear feedback about our Annotations and any other insight you would like to share. What did you like? What didn’t you like? Feel free to reach out to us at any time.
Until next time, happy on-call! :)