VictorOps is now Splunk On-Call! Learn More.
It’s 6 PM on a Friday and your database service just failed. You, the on-call engineer, are the sole team member left in the office. Your colleagues may or may not be paying close attention to Slack channels and email as they make their way to weekend destinations.
How do you figure out who can help you solve the database issue? If you don’t know who wrote or maintains the code that powers it, you’ve found yourself in a tough spot indeed.
This lack of visibility is a common challenge for modern organizations. Although, in theory, DevOps means everyone “owns” everything within the realm of software delivery, the reality is that few organizations can achieve this exact arrangement. It just isn’t practical for every engineer to master every part of the codebase or deployment.
That’s why DevOps teams need to establish policies for determining who takes service ownership of which parts of the application. Visibility into service ownership is critical for efficient, reliable software delivery. Below, we’ll go into detailed tips and tricks for approaching this cultural and organizational challenge of DevOps.
The DevOps engineer or engineers who own a particular component of an application have two primary responsibilities:
1) Understanding how the code and configurations associated with that component work, and being able to explain them to others, when necessary
2) Participating in incident response operations, either as the primary respondent or as a reference point for the on-call team
The above doesn’t mean the owner of a given component needs to be permanently on-call for supporting it, of course. That’s not realistic. But, when something does go wrong with a component and the on-call team needs help troubleshooting or resolving the issue, they need to know who to contact.
Unless you plan deliberately for this challenge, knowing who’s on-call isn’t always easy. At many organizations, developers taking on-call responsibilities aren’t part of the culture. And even if they are, developers don’t typically sign their code so that others can easily determine who wrote it. And, even if they do, the person who wrote a given piece of code may no longer be working at the company. There’s no button you can press or spreadsheet you can open to figure out who understands how a given piece of code is supposed to work or how it should interact with other parts of the system.
Thus, when something goes wrong with a specific part of an application, it’s often difficult to quickly determine who has the expertise necessary to fix the issue quickly.
This isn’t a problem you can solve reliably in an ad hoc fashion. Sure, if you have a small enough company, perhaps the engineers know each other and your code well enough for the on-call team to quickly identify the owner of an application component. Or, you could always try to figure out who wrote certain code by looking through revision history data. But, these approaches don’t scale; nor are they an efficient use of the on-call team’s time. When a service goes down, the last thing you want to have to do is scroll through commit logs trying to figure out who wrote it.
However, with the right tools and strategies, this challenge can be mitigated.
As a best practice, try to avoid a policy that requires whoever writes code to be solely responsible for owning it. As noted above, sole ownership is a bad idea because the original author of code could move on from your company or might not be available when a problem strikes.
Instead, establish a system of “code buddies.” By ensuring every time one engineer makes a code commit, at least one other team member is aware of the change and understands what it does. Designating multiple buddies for each commit is even better.
This approach spreads service ownership across a group, rather than consigning it to an individual. It thus maximizes the chance that someone with the requisite knowledge can be reached when something goes wrong.
If despite the buddy system, you still have trouble tracking code ownership, the git blame command is your next best recourse (assuming you use git, of course). Git blame provides a faster way to track down who has worked with a given snippet of code than trying to piece it together using logs.
The main limitation of git blame is that it can be used only on individual files. If the application or service you are troubleshooting doesn’t map neatly onto a single source file, git blame won’t get you very far.
So, while git blame can be a handy way to track ownership in a pinch, it’s not a complete ownership management solution on its own.
Although git bisect isn’t designed to track ownership of code commits, it’s a great way of keeping track of application changes that are known to be good (or bad). If you encourage your team to use git bisect, you’ll keep records and other post-incident documentation that can help you track down problematic application changes quickly.
Like git blame, git bisect isn’t a complete ownership management solution. But it’s a good way to quickly separate working code from broken code, even if you can’t reach the owner of the broken code.
Slack (and similar communication tools) can help you quickly figure out who knows what. Consider creating a Slack channel that includes all your engineers, whether they’re on-call or not. In the event the on-call team can’t figure out who owns something that’s causing a problem, they can reach out in Slack and (hopefully) track down the owner using your engineers’ collective mind.
This isn’t a systematic approach to documenting ownership but it will help solve the incident response challenge in a pinch.
It’s difficult to find the person or team with the expertise you need to troubleshoot an application problem. Especially given the fact that everyone doesn’t know everything about your systems. There’s no perfect solution to this challenge but the right tools and strategies can help.
Learn how DevOps teams are improving real-time incident management without hindering development. Our free guide, Why DevOps Matters, can show you how more teams are documenting service ownership and building more resilient services over time.
Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure, and networking. He is a Senior Editor of content and DevOps Analyst at Fixate IO. His latest book, For Fun and Profit: A History of the Free and Open Source Software Revolution, was published in 2017.