VictorOps is now Splunk On-Call! Learn More.
Working for VictorOps, and now Splunk, has allowed me to experience the on-call process from several distinct angles. And, working in a customer-facing role, I’ve witnessed the full spectrum of DevOps maturity – from downright DevOps mastery to the kinds of nightmare scenarios that haunt the dreams of on-call professionals. As an individual contributor to an on-call team, I helped lead the arduous transition from a noisy, burn-out-inducing experience to an infinitely more humane and effective system. I’ve also been a part of building an on-call culture from the ground up, introducing an engineering team to on-call processes for the first time.
What has emerged from these experiences is an overwhelming belief that on-call doesn’t have to suck. Done right, a great on-call culture will have a net positive effect on the entire software lifecycle, simultaneously improving the lives of your on-call engineers and making both production systems and your internal business culture more resilient in the process.
Nailing down a precise definition of DevOps is no easy task. So, translating core DevOps ideas into practical terms and actually applying them is a real challenge for any organization. The good news is, the act of putting developers on-call naturally creates the right incentives for improving the system. As DevOps-minded engineers increasingly provide long-term support for their production code, they have a vested interest in the quality of monitoring, alerting, documentation and tooling. After all, they’re the primary beneficiaries of any improvements that reduce noise and empower on-call engineers to respond more effectively. However, this doesn’t mean that the mere existence of such incentives will produce the desired outcomes.
Think of building your on-call culture like cultivating a garden, in the sense that gardening is a continual quest to reduce or eliminate any environmental factors preventing your plants from thriving. The seeds you choose come with a pre-built genetic blueprint for growth and survival – but you can’t simply scatter them in the wind and expect nature to spontaneously deliver a world-class botanical garden. In order to flourish in the first place, the environmental conditions must be favorable and the garden must be endlessly tended with forethought and care. There is no set-it-and-forget-it product you can buy that will do it all for you – on-call simply doesn’t work that way. If you want a garden, you’re going to have to take up gardening.
Just like gardening, the optimal approach to building an on-call culture depends greatly on a myriad of individual variables. But, there is enough commonality between on-call responsibilities to engage in broad generalizations. What follows are a set of core principles that I believe are integral to an effective and resilient on-call culture. Each is a loose grouping of ideas sharing a common theme, all of which roll up into the grand ideal of an on-call culture that’s both highly effective and humane.
Being on-call outside of normal work hours sucks, and we shouldn’t act otherwise. On-call time is encroaching on the personal lives of engineers, causing stress and shifting their work-life balance toward the less healthy, less gratifying end of the spectrum. Bad experiences with on-call will reduce job satisfaction and contribute to costly turnover. It’s possible, even common, for it to be handled poorly. For this reason, you can anticipate a certain level of cynicism, or at least a healthy skepticism from engineers being asked to go on-call. Many of them may have direct experience with toxic, burnout inducing on-call scenarios. Here’s the good news, simply acknowledging these facts and charting a course to something better can help, provided it’s not just lip service.
Having an honest conversation about those potential negative impacts paves the way for opportunities for the continuous improvement and empowerment that follow. Here are some talking points to consider:
While concern for the health and wellbeing of your core infrastructure is likely an important issue for your engineers, it probably comes second to more personal concerns about the impact on their private lives – and that’s OK. These concerns will only abate with time and positive experiences. But, a frank conversation reassuring everyone about the business’ understanding of this fact is a good first step in the right direction.
One-size-fits-all solutions generally don’t work in this space. Discuss the opportunities for team-level autonomy and creative problem solving. Think laboratories of on-call, where individual teams are free to negotiate their own scheduling, monitoring and alerting – so long as they meet their objectives. Knowing the team has a green light to tackle their own problems is a psychological boost that makes unavoidably tough on-call experiences much more bearable in the short-term.
On-call cultures evolve over time and rarely look anything like their original blueprint. There simply isn’t a way to know exactly how things will eventually come together. Some cherished ideas will, and should, die in the process. Make a commitment to be flexible and roll with the changes.
A good on-call culture requires resilient, well-built systems. But, this fact isn’t helpful when tasked with introducing an on-call culture to an engineering team whose production environment isn’t quite there yet. So, identify the areas that are likely to be problematic and solicit ideas for addressing them. After all, building out a great on-call culture is not a detour on the road to greater reliability, it’s the vehicle that takes you there.
Problems will arise, and rarely at a time that’s convenient to deal with. However, it’s possible to build an on-call landscape to reduce the business impact of those problems while still minimizing the impact on the lives of on-call engineers. But, this change requires ongoing effort and meaningful investment. There’s no single product you can buy that circumvents this reality – be wary of anything that sounds like a quick fix to all of your on-call woes.
Knowledge silos are an extremely common occurrence in the tech world. But, knowledge silos can impede a great on-call culture because they concentrate the burden of being on-call to a few individuals. So, you need to acknowledge the gaps and formulate a plan for eliminating them. Then, make it an explicit goal of the team to diffuse that burden. This type of hero culture is toxic – don’t let anyone on your team be a martyr, and don’t celebrate on-call martyrdom.
Burnout is real. And, it has very real costs for the organization and for the people involved. Create a space for on-call engineers to talk about work/life balance. The relative division between work/life is directly proportional to the financial stake the individual has in the success of the company. For founders, CEOs etc., there’s virtually no division – work is often your life. Don’t expect salaried employees with little financial stake in the company to feel the same. For them, life is everything that happens outside of work and they’re not going to appreciate the balance being involuntarily shifted in the wrong direction.
Every organization needs greater transparency across all on-call operations, release management, QA, technical support, marketing, etc. When nothing happens in the dark, DevOps-minded engineers as well as all of their business counterparts can make better decisions. The following elements are crucial to an efficient, transparent on-call process:
Establish clear expectations
Procedures and norms should be clearly identified and publicly available
Publicly recognize and praise people for responding to events during their on-call shifts. This is an easy, effective, and entirely-free way of improving on-call morale, there’s no excuse for not doing this
Never let anyone struggle on an island
Make a public commitment to reducing the burden that on-call has on employee’s lives
Post-incident reviews matter, do them often
Adding new alerts to the mix has to be discussed and agreed upon by affected individuals before going into production
Formally recognize and celebrate the sacrifices made by on-call team members – public recognition staves off negative impacts on morale
Teams must have full ownership of their on-call process
Can’t be management who controls the alerts while users deal with them (that’s toxic)
The full monitoring and alerting workflow
Escalation processes within the team
Have very high standards for what can and should notify users outside of work hours
There shouldn’t be more than 3 - 5 things that fall in this category
Take the inclusive approach rather than the exclusive approach (“Everything pages and then we’ll pare it down later” = “We don’t care about our people and their livelihood”)
Get bottom-up commitment from those who will own their solutions and work to make their own lives better
Set up regular meetings to stay accountable and maintain consensus
Action items are items that require action
Be willing to adapt
On-call work doesn’t end – it’s like brushing your teeth, you have to keep doing it until you die, or at least until you don’t have any teeth
Measure your performance, both as a team and in total downtime / overall response effectiveness
Set ambitious goals for on-call and incentivize them
Admit where you are and know where you want to go. On-call engineers are more likely to be supportive if they all understand and share the vision of a better on-call culture
Try to avoid shadow on-call (an under-qualified person being on-call who has to immediately escalate to the “real” on-call engineer who’s “off”)
Don’t accept noise as a part of the landscape. Chase down unactionable alerts and eliminate them, now.
You will never think you have time for on-call improvements – you either make it a priority or you don’t, full stop.
Make on-call related work official by rolling it into your normal planning process and treating it just like any other work. The less you treat operational and on-call improvements like a separate category, the better.
Standardize around a common format or template for runbooks and wikis.
Have a central repository for action items and track how you assign them to teammates
Creating a powerful on-call culture isn’t as much about technologies and techniques as it is about thoughtful approaches. Every team is built differently and there isn’t a one-size-fits-all on-call process that works for anyone. Hopefully these central tenets of a great on-call culture will help you think strategically about the way on-call not only affects the reliability of your systems, but the livelihood of your teammates.
Try out VictorOps with a free, 14-day trial to learn more about making on-call suck less. See how you can leverage a collaborative incident response solution to improve on-call transparency and create more humane workflows while reducing MTTA and MTTR.