The Dev and Ops Guide for Incident Management


A modern approach to creating an on-call process, defining team roles, and getting more sleep.

Ask members of a traditional DevOps team what it’s like to be on-call and you’ll likely hear a variety of answers such as, “it’s part of the job,” “it’s stressful,” or the very direct, “it sucks.” As innovative teams incorporate non-traditional Ops folks into the fold, like developers, they need to bring a modern approach with them. This new modern incident management framework leans on automation and seeks to accelerate and streamline the often slow developing on-call process, while also keeping everyone on the same…
Read More

Available is the new On-Call

As teams look to grow their DevOps practice, they face many fundamental challenges. Integrating Developer and Ops workflows provide massive lifts in efficiency, but require focused work. Continuous Deployment offers a step-function in development velocity, while requiring a sea-change in the way Ops manages systems. Sharing responsibility for Applications and Infrastructure across a wider team brings experiential benefits and integrates teams with historical silos. While sharing responsibility for infrastructure is great, the ugly truth of DevOps is that most people don’t want to be on-call.…
Read More

U mad bro? Disaster planning for on-call

Disaster. That word gets used a lot in our circles–it’s a trigger to the deepest FUD argument a vendor or colleague can make. A disaster can be defined in any number of ways: the number of customers impacted, revenue loss, or the number systems impacted. There are many metrics by which a disaster will be judged. For an on-call team however, the tale of a disaster is told in the minutes and the hours. Much like a security breach, the reality of a systems disaster…
Read More

Context through coupling: JIRA for on-call teams

Some of the most visible artifacts of organizational silos in engineering are tools. Visibility into dashboards, workflows, or documentation is cordoned off in separate and often redundant systems. Conway’s Law manifests in tool choices and integrations as much as in application development. As a team matures an Incident Management or DevOps practice, breaking these tool walls is necessary. In this post I’ll explore a low effort, high value way that you can extend integration between VictorOps and JIRA to break down those silos, and empower your…
Read More

On-Call Ways and Means: A Developer’s Guide

Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important. For those unfamiliar with Incident Management, there are some common misconceptions that fuel a fear of accepting on-call responsibilities. Chief among those are: – I’m going to be woken up for…
Read More

On-Call Handoffs: Empowering Adaptability in Incident Response

Managing on-call teams has always been a challenge in complex environments. With the continued adoption of Continuous Delivery, the challenges are squared. Now, not only do you have to manage a complex environment, the environment is changing dozens of times per day. On-call today has to be less about a strict execution of predefined procedures, and more about adaptability. Smart people, acting with good situational context, tend to make the best decisions. Those same smart people must be empowered with necessary skills and tools, but…
Read More

The State of On-Call Report: This is the Top Takeaway

Let’s get right to the point. On-call people fall into three equal categories: happy, neutral, and miserable. There are specific, consistent reasons why. By reading the State of On-Call 2016-2017 report, you will be armed with methods to reduce the misery and make on-call suck less. Introducing the State of On-Call 2016-2017 report, in which over 800 respondents shared insights about life on-call, infrastructure, culture, costs of downtime, incident management maturity, and DevOps practices. The Miserable Third Extremely unhappy on-call respondents suffer from powerlessness to solve problems,…
Read More

The State of On-Call 2016-2017 — Kicking off Results Season

We collected the results, crunched the numbers, and are on the verge of launching the State of On-Call 2016-2017 Report. Big thanks to the 800+ people who participated. This Thursday, you’ll get a first look at the findings in a webinar we’re conducting with Alan Shimel and DevOps.com. Please join us. Todd Vernon, Joni Klippert, and I will discuss the survey results, including: • The factors that correlate with on-call satisfaction versus on-call misery • Structural and tooling trends • How DevOps practices impact the on-call
Read More

Success Stories for Engineers On-Call

Real-time monitoring and alerting are critical to maintaining the performance and security of your infrastructure. But, with today’s astounding access to data, it is important to use the right technology to manage alerting in a way that’s customized to your environment. If not, alert fatigue will take hold and your teams will lose their ability to respond to incidents quickly and effectively.

Read More

Bringing Your Sales Team On-Call

When most people hear the phrase “on-call”, they likely think of doctors, or folks in the medical profession. But those of us in IT know there’s another on-call world — one where Operations & Development Teams alike are awakened at 2 am to a notification telling them their servers are crashing, where Sys Admins are working away at midnight because they got a page saying their website was down, and hopefully, a world where remediation can take place faster. The Why Time is of the…
Read More