Matthew Boeckman - April 10, 2017
As teams look to grow their DevOps practice, they face many fundamental challenges. Integrating Developer and Ops workflows provide massive lifts in efficiency, but require focused work. Continuous Deployment offers a step-function in development velocity, while requiring a sea-change in the way Ops manages systems. Sharing responsibility for Applications and Infrastructure across a wider team brings experiential benefits and integrates teams with historical silos. While sharing responsibility for infrastructure is great, the ugly truth of DevOps is that most people don’t want to be on-call.
Frequently, when we speak with developers there is wide and passionate interest in making things better. Engineers can’t abide broken things, especially things that they helped create. Both developers and administrators recognize that “the application” is really the sum of the Infrastructure and the Software. It’s clear to everyone that solving challenges with the application requires an integrated team.
That’s all well and good, but historically Developers shy away from formally joining on-call rotations. In the past, as I’ve interviewed Operations candidates, the attitude towards Responsibility is universally acknowledged to extend past conventional working hours. On the other hand, bringing up Responsibility with Developer candidates frequently dead-ends. Happy to be Responsible as heck between 8 and 5, but massively drops off past that. The best developers always want to own their code, but are wary of joining a formal on-call rotation.
To most outside observers, Operations jobs are not sexy. “Ops is a service org.” “Admins just clean up the mess” “Ops is a cost center” The most visible manifestation of Operations in most organizations is the Helpdesk tech crawling around on the floor plugging things in. The next most visible manifestation is the emails we all receive explaining an outage, apologizing for the downtime, and excusing the team from meetings as they catch up on sleep.
Developers, or other non-traditional on-call folks mostly hear the horror stories. They hear about the pain. An entire night sleep lost thanks to a failed drive. A three hour outage to the accounting system thanks to a blown upgrade. Six alerts this week fighting disk space issues. Vendor released a new critical path security update that broke session management on the load balancer.
Given that glamorous perception, who wouldn’t want to do more Ops work? It’s little wonder that people in non-traditional Ops roles are reluctant to join this exciting career! Throw 24x7x365 on-call responsibility into the mix and the whole package sells itself.
Many of us (myself included) are guilty of an all-or-nothing approach to this. DevOps means Developers are On-call. On-call means 24x7x365 Responsibility for the software and application our team produces. Responsibility means long hours, countless wakeups, having to learn an entirely new skillset, and limited personal freedom. Right?
What if we focus instead on whittling away at the communication barrier? What if we just make developers reachable? Can you DevOps if your Devs are not at parity with their Ops friends’ on-call rotations?
With that context, we here at VictorOps would like to encourage you to consider a new cultural conversation with your teams: A Culture of Availability. No, not that kind of Availability - not uptime, not resilient - a culture that makes all members of a team available. Available to contact. Available for escalation. Available to help.
During an outage, every second counts, and for the teams in first responder status there are benefits as well. Standard Operations teams lose countless minutes trying to locate an escalation resource. Scrolling through company directories, reading email signatures, or texting friends of friends trying to locate a Subject Matter Experts’ contact information all detract from timely and efficient Incident response. Adopting this idea of Availability in your extended teams puts the power to resolve events in the hands of those on the front lines.
Encouraging your developers, QA, or product teams to join an escalation rotation can make a massive impact on your incident management efforts. Reducing or removing entirely the fear that on-call equals first responder status, a Culture of Availability paves the way for teams to dip their toe in. Along this path new on-call members can interact with Incidents, observe how rotations work, participate in firefights and postmortems, all without the stigma, or fear of accepting Capital R Responsibility.