Matthew Boeckman - January 20, 2017
Managing on-call teams has always been a challenge in complex environments. With the continued adoption of Continuous Delivery, the challenges are squared. Now, not only do you have to manage a complex environment, the environment is changing dozens of times per day.
On-call today has to be less about a strict execution of predefined procedures, and more about adaptability.
Smart people, acting with good situational context, tend to make the best decisions. Those same smart people must be empowered with necessary skills and tools, but let’s accept for now that this is already true for your team. The challenge then is context, and especially for an on-call team who hasn’t seen rotation in four weeks!
It becomes critical that you’re able to effectively and quickly dial them into the current reality of your environment. I like to think of this in three parts:
What unexpected things happened? Those are Incidents.
What expected things happened that have changed the environment? Those are Deploys.
What expected things will be happening during this rotation? Those are the Plans (new deployments, sales promotions, audits, penetration tests, etc).
Handoff sessions have long been a mainstay of team rotations. Running the spectrum from a few Slack messages to a formal postmortem of every incident in the preceding period, a handoff is key. Keeping those meetings focused and efficient is where the VictorOps Timeline feature really adds value. The Timeline is a quick dashboard view into all incidents occurring in your environment for a given time period. The timeline enables a team to quickly review incidents, and easily create a postmortem report associated with a single incident, or a multitude of incidents happening over a period of time.
When you sit down for that handoff session, reviewing postmortems is a required practice. A live discussion with both shifts (leaving rotation, coming online) can be meaningful in many ways. Certainly, the color of a situation is communicated far better verbally than through a postmortem report, but the discussion should also be a collegial critique of the postmortem itself. Were the details covered in sufficient detail? Too much detail? Were appropriate runbooks updated? Were post-action tasks completed?
Deploy Review gives you a basis for situational context, but the reality is you’re only providing a portion of the picture. Only changes introduced by incidents are covered here, and you’re left to find some other process to bring a team’s awareness up to speed with respect to code, system, or architectural changes that may have occurred in the intervening time. This can be accomplished in a variety of ways, to be sure. Looking at changelogs, ticketing systems, deployment pipelines, and more, are all effective at detailing intentional changes introduced to an environment. However, I prefer to keep team tool-switching limited, and provide as much of the necessary information in the same system.
A simple approach
VictorOps has provided a solution enabling you to adopt this practice as well. A dummy system group, “deployments,” can be created for all “incidents” that are, in fact, deployments. To get started, set up an escalation policy associated with that group that does, in effect, nothing, with a dummy webhook, or a null email address, for example. With that setup in place, you can then add triggers to any deployment job that you want to create deploy incidents in VictorOps. The practical upshot of this: you get a running history in VictorOps of all incidents (accidental or intentional), providing extremely rich contextual information to your on-call teams.
I’ve left the most difficult for last. Predicting the future with any certainty is an intractable problem. That said, approaches exist that can be effective at empowering your teams with excellent situational context:
Invite members of Product teams to the handoff meetings to discuss big or risky projects going live in-period.
Invite members of Marketing or Sales to similarly discuss planned promotional activities, sends, or events.
The power of effective handoffs in combination with VictorOps really comes to bear once you’re able to record those planned events in a system. Whether a Calendar invite, a cron job, or a tag on a deploy, you can use the same approach above to update the VictorOps timeline with reminders.
Incident #527 - Sales Kick Off begins 09:00AM Incident #603 - Holiday Sale is Live 06:00AM
While ancillary to the specific act of responding to incidents, this kind of information keeps your teams dialed into the reality of your environment. Heightened awareness empowers those teams to make good decisions, and adapt!
Like anything that happens in a DevOps or Agile environment, iteration is key. Whether you implement the ideas I’ve laid out here, or they spur ideas of your own - implement, test, and iterate! A monthly retro on how handoffs are working, with a willingness to implement change, is how to make handoffs most effective for your team.