Matthew Boeckman - February 17, 2017
Organizing an on-call roster is tough. Aligning skills, experience, and availability with specific application technologies is difficult. In most cases you settle for “close enough” and hope smart people make good decisions. Skills and scheduling are really only the beginning; effective incident management requires focus on how your on-call team operates.
**I like to think of team dynamics on two dimensions. **
First, there is the structural organization of the team–people playing roles, workflows, and escalation paths. This is important to on-call teams because it impacts the procedural behavior of incident management.
Second, there are the cultural aspects. This dynamic is important to on-call teams because we’re all likely to interact in high pressure moments in the middle of the night. How we treat each other matters.
Many responses in Incident Management can be managed down to simple runbooks or automated completely. These simple, low diagnostic threshold incidents are the bread and butter of an on-call team. Nearly anyone skilled enough to be in rotation can meaningfully respond to the incident. Java crash? Restart the app.
Sometimes, the cause of an alert or the necessary diagnosis exceeds a specific responders’ abilities. While (hopefully) less frequent, these kinds of alerts tend to indicate something is wrong in a larger sense. Whether this is just a tricky problem, or Rome is actually burning, the first responder has to know how to get help.
It’s important to provide that escalation path in the simplest, easiest method possible. At 3am, while a database burns, you don’t want a Unix admin breaking out a call-down tree. Every moment the response team spends notifying others, or escalating, is a moment of effective diagnostic or remediation wasted. How you set up your on-call team for this moment can turn the tide of a firefight.
We can all imagine the pain suffered by a first responder when they call for help and no one comes. Certainly that’s a problem that must be addressed.
Paradoxically it can actually be worse when the first responder calls for help, and everyone shows up. Without direction, a dozen engineers trying to crowd solve a problem is a recipe for disaster. This is where two new escalation roles begin to show value: an Incident Commander (or Quarterback), and an Incident Communicator (or Scribe).
The Quarterback is responsible primarily for directing and de-duplicating the efforts of the team. They provide ad-hoc organization to the swarm, and ensure that some of the team is focused on resolution of the proximate cause, and some of the team focuses on remediation or mitigation efforts if a timely resolution is not likely. The Quarterback likely also plays the role of risk advisor and assessor. If a fix is risky, they’re making the call.
The Scribe manages inbound and outbound communication with others in the engineering team, or others in the business. A Scribe can also manage internal documentation of the event itself–capturing investigatory threads, who’s looked at what, and changes made “in the heat of battle” in a simple, easy-to-digest document. As new team members join the response party, scrolling back the last 200 lines of chat is not practical, but having a running synopsis of what’s going on is desirable.
To further illustrate the idea, let’s take an incident pattern and walk through the firefight.
04:07-04:15 - Series of alerts are triggered indicating slow response times on application servers. Alerts routed to an individual on the platform on-call team.
04:09-04:23 - Series of alerts are triggered indicating high CPU and Memory utilization on application servers, high CPU and high Disk I/O on a database system. Alerts routed to an individual on the ops on-call team.
Question: What’s going on? Clearly something bad.
04:30-04:35 Both teams declare an emergency and escalate to the Quarterback and Scribe on-call. They join the chat and begin getting up to speed.
Question: What (if any) queries are responsible for the high DB load? Is the DB healthy? What (if any) application calls are churning through that CPU? Are the App servers healthy? Any recent changes going out? At a high level, is traffic to the application within expected norms for the time period?
Scribe begins manual escalation and notification of SME’s or others deemed necessary to help. Scribe begins the running summary document, or otherwise documenting the incident. Public facing status pages are updated. They notify Customer Support, and key business stakeholders of the event, and invite them to monitor the summary document.
Backend team isolates high CPU calls to a few specific operations, each hammering the DB, and all triggered from a few common URL patterns.
Database team confirms the database is generally healthy, but high volume of expensive queries coming from aforementioned application operations.
Quarterback directs new level of inquiry focused on the application; while directing the database team to investigate reducing query execution time.
Quarterback requests additional escalations to network team to investigate potential traffic rerouting, Scribe works through getting them up to speed.
Question: Where is the traffic coming from? What is it triggering this condition now? Did new code go out recently? Any system changes? Any security concerns with the traffic?
I won’t continue the tedious play-by-play in this example, you get the idea. Lots of people, running in different directions, quickly. Multiple independent investigative threads are going, avenues explored and abandoned, individuals are focused and refocused against new questions. All of which must be captured to prevent duplicative effort, and provide a clear narrative for future postmortem analysis, and to new team members joining the fight.
As this example plays out, each team is focused on their functional area of expertise and responsibility. Transparency to stakeholders is maintained without distracting the responders, and a strong narrative is built enabling handoffs, efficient triage, and an eventual postmortem analysis.
While a Quarterback is almost always a senior member of the team, the Scribe role is a perfect place to introduce junior or new team members to the on-call process. Scribes can come from any background, and get an opportunity to observe the diagnostic and remediation work without bearing the technical burden personally. Lastly, a Scribe is an active, meaningful participant, who will both add more value in, and extract more out, than passive observers or ride-a-longs.
I won’t keep you waiting–the proximal cause of this incident was a cache tuning exercise gone wrong. The network team shunts traffic to the offending URLs off to a holding page while the developers prepare, test, and deploy a change to make the application less database-clobbery.
Formal team structure is critical to the success of your firefighting efforts. It sets the team up for success because they won’t be hampered by purely systemic failings during an emergency. How we all behave in the moment, however, is far more impactful to the outcome of a specific incident, and a team’s long-term success.
While much has been said about the importance of keeping after-action analysis blameless, I think it is doubly important to keep escalations blameless. A lone wolf toiling away in solitude makes for a great comic book, but rarely leads to effective resolution of incidents in complex systems.
This basic tenet is often missing in on-call team culture. Individuals want others to respect their abilities. They’re concerned about how they’ll be viewed if they “need help” too often. Previous acrimony creates a reluctance to engage other teams. Managers frequently highlight the work of “Heroes”, establishing a patterned expectation of individual action.
This theme of blameless escalation becomes doubly meaningful to teams expanding on-call to non-traditional teams. Developers, unfamiliar with organizational or industry norms, will be pinned between a reluctance to get help and a fear that they lack the skills to adequately respond by themselves.
Whether you like the ideas of a Quarterback or Scribe, or whether they spur other ideas on how to more effectively manage a complex incident, make sure to keep team composition and the structure itself a data point in the postmortem. Evaluating the efficacy and value of the team is often missed, as teams narrow the focus of a postmortem to the proximate causes and remedies of the incident itself. Always reserve some time during the postmortem to ask how the process went, how the team dynamics played out, and use those answers to inform the next iteration.