Matthew Boeckman - October 12, 2017
The Remediation phase of the Incident Lifecycle is the action packed firefight. The main event. The diagnosis. The fix. The attempted fix. The workaround. In this phase we address whatever pattern created an incident in the Detection phase of the lifecycle.
Teams who focus on reducing Mean Time To Repair (MTTR) rightly focus heavily on the Remediation phase. While improvements to Detection and Response can be meaningful, the majority of most incidents burn through that TTR metric in Remediation.
Remediation is a tricky place to be. Certainly, simple incidents can and should be resolved quickly. However, as incidents uncover unexplored areas of our stack, or cascade into full blown events, things get complicated.
It’s all about the Data
Understanding a behavior is the first step to changing it – true with people, true with systems. Incident Management spends time and capital building robust systems to collect and analyze metrics. Time-series systems like Prometheus or InfluxDB are incredible at getting a team settled in the reality of certain metrics over time, letting you know that yesterday this was 37, yet today it is 148. Is that normal? Is that expected? Does it help us remedy the incident we’re working?
Log analytics systems like Splunk or Sumologic provide a single point of entry to view and search through logs and events generated by our stacks. Error 472 occurs at low volume frequently but has a small spike correlating with this incident timing. Is it significant?
Whatever your approach to metrics, helping an incident responder quickly get to the most relevant data is key. Moreover, helping them understand the data present within the context of the system or service under investigation is needed.
Runbooks pave the way
Wherever your team is on the path to reduced MTTR, creating, updating, improving, and enhancing runbooks is always a valuable activity. Runbooks can be particularly helpful as DevOps adoption grows because developers may be unfamiliar with infrastructure, and operations may be unfamiliar with applications. A solid runbook provides the first breadcrumbs for any responder, whether they personally wrote the service, or they joined the team on Tuesday.
Working on runbooks in this way can certainly aid your team in faster remediation of incidents. When arguing for a focus on runbooks, remember that runbooks can have a much broader impact on a team. As new members join an engineering group, or an on-call rotation, getting up to speed is tough, and runbooks can mitigate onboarding issues. For seasoned veterans, this kind of brain dump helps to remove them from the “the only one who knows how that thing works, call them!” mentality.
Every team has a slightly different approach and template to their runbooks. I’ll provide here a basic template that you can either steal entirely or riff on for your own designs.
The overview is typically a short but meaningful description of the system or service. The description should include all the basics: language, environment, and an architecture diagram. Assume the reader has little familiarity, and paint a broad picture.
List out all the known operational procedures for the system. Including, but not limited to:
What does this system rely on? What, in turn, relies on it? Dependency tracking is as important for Incident Management as it is in development. If a system or service is failing, it almost certainly created a pileup of errors upstream and downstream in its workflow. Remediation of an incident must include a remedy of consequences, so helping a responder know where to look for cascading problems is important.
The status section may include embedded images from dashboards or known key metrics. Links out to relevant metric and detection systems associated with this system. Some people may push on this idea; however, if you’re serious about reducing MTTR, making it easy for someone to find relevant data is a good thing. Hunting through Grafana sounds like fun in concept, but when it’s 3:00 am and the world burns, you should have all the information you need in a brainless click.
Who to call is an essential, often overlooked, topic in runbooks. I have previously advocated for Blameless Escalations, success in Incident Management requires a clear escalation path. If things are down, don’t fritter away precious minutes scrolling through the company directory. Know who to contact when you’ve exhausted your options, and know how to do it quickly.
Assuming that all responders in a team are familiar with all alerts generated from all detection systems is an absurd bar. It’s equally ridiculous to assume that a responder is familiar with all the alerts in a single detection system. Certainly some alerts need little explanation – “Disk Full” means, quite obviously, the disk is full. Nevertheless, in this section of a runbook you should provide some helpful context for the other, more ambiguous, alerts. ‘Process_max_fds’ may be perfectly clear to you, but don’t assume it is to your teammates.
Does this application hang frequently, requiring a restart? When a system upstream from it fails, does this application require manual queue grooming? Are we aware of a bug that causes this to error when connection count is north of 200? This is a dynamic section of your runbooks, and should be updated as new patterns assert themselves.
Another way to accomplish “Known Failure Conditions” is to integrate with your ticketing system and display the incident history associated with a system or service. Responders who know the story are more likely to remediate the current condition faster.
Encourage your incident responders to update runbooks as part of their response to an incident. Begin with ensuring your service has returned to a healthy state, after which you should take 10 minutes to update the runbook. Did you discover an undocumented dependency? Was a particular metric helpful? Write it down in the moment, don’t wait and hope your tomorrow self will remember!
Many readers will look at this template, and their own runbooks, and get a little bewildered as they imagine the effort necessary to implement a great runbook. This is where dedicated time for Incident Management in every iteration is needed. You almost certainly cannot, and should not, try to write runbooks for everything, everywhere, all at once. Start with the most critical systems, or the most noisy, and build from there. With focus, and effort, you’ll soon find your team not only resolving issues faster but also building a massive knowledge base behind them to grow their success.