Dan Hopkins - April 10, 2018
VictorOps, like many startups, has gone through major growth in the last couple years. New teammates, new customers, and a maturing organization have all demanded we continue to raise the level of our service, which, in turn, requires improving the reliability of our system. Starting at the beginning of 2017, this change was focused on building an SRE team.
Initially, this team was formed with some of our top-performing, self-starting engineers and was given a simple mission: figure out how to build a scalable way for us to understand the impact our deployments were having on customers.
We bought books, gave the team dedicated time to go and learn, and were even able to bring in SRE leaders from other organizations. We interviewed people from Twitter, Google, Github, and Netflix who had successfully built SRE teams in the past to figure out how they worked. In addition to learning quite a bit about Game Days, we found the two schools of thought focused on centralized or decentralized SRE.
The benefit of centralized SRE is that dedicated, specialized resources can achieve a high degree of skill. This expertise is likely higher than what’s achieved while working on other activities such as product development. They also have the bandwidth to expand and enhance the scope of the practice, the researching tools, best practices, and an improved ability to watch the system for misbehavior.
The downside of centralized SRE is that responsibility for the system’s reliability begins to migrate solely to the SRE team. The people creating the monitored system or application are no longer responsible for producing quality metrics that allow observability. Of course, this division of responsibilities is not impossible to overcome, but it does concentrate accountability into a single team.
A decentralized approach has some of the same benefits and problems in reverse. The team, as a whole, continues to own reliability and builds new features with this thought in mind, but their overall skill suffers.
We decided on a hybrid approach between the two models. We broke the work into two components: leadership and execution—one centralized, the other decentralized. We moved one engineer who was originally part of the SRE research into a coaching and facilitation role. This person fosters a council of members from different areas of the engineering team to get together and talk about what’s going well, and what’s not going well with the monitoring, observability, and the general reliability of the system. The council is in charge of designing experiments to be run during our Chaos Days. They talk about tooling and specific stories to be worked on in future increments (an artifact of our scaled Agile process).
We decentralized the work and operational tasks into the teams. Our product engineering teams know better than a centralized SRE team how their code will break in production. They know the best places to add metrics and how to produce dashboards that are meaningful for the entire team.
You can see the whole story by downloading our free eBook, “Build the Resilient Future Faster: Creating a Culture of Reliability”.
Overall, this team has been extremely successful at building a shared vision for SRE and executing on it. The biggest win I’ve noticed so far is how much the level of reliability has improved since we built this team. Moreover, the level of responsibility and skill across the entire team has also been refined. Conversations about adding new metrics now happen much earlier in the development cycle.
I believe this collaborative approach to building and running an SRE practice has created huge wins for our customers, our company, and our teams.
VictorOps improves system observability and communication during chaos experiments and real incidents. Sign up for your free trial to see for yourself how VictorOps improves SRE.