What is SRE to Me?

What is SRE to Me Blog Banner

What’s SRE to you? It’s hard to define because SRE looks different to every team, company, and individual. So, we wanted to chat with a number of people on our SRE council to find out what SRE means to them.

We previously wrote about the idea that SRE is not a dedicated role, it’s an organizational behavior. So for us, much of SRE boils down to a willingness to learn, experiment, and continuously improve our process in order to consistently deliver reliability.

What is SRE to Us?

Let me preface this by saying that SRE continues to evolve within VictorOps. An effective SRE structure at another organization may look completely different, but there are typically a number of similarities between organizations that proactively approach reliability.

At VictorOps

Over time, VictorOps has shifted how SRE is structured internally. But currently, after a couple iterations of initial SRE infrastructure, we created the SRE council. The council consists of members from different areas of the engineering team (data, middle tier, platform, web, mobile, etc.). These members convene once a week to discuss baking reliability into the product and how to cross-functionally improve SRE.

When building our own SRE strategy, we brought in SREs from Netflix (Dave Hahn), Twitter (Matt Getty), and Craftsy (Matthew Boeckman) to better understand what SRE looked like in their organizations and teams. Then, there were presentations to management, engineering, non-engineering, and product teams in order to get full organizational buy-in to SRE efforts. But now, SRE is a full-time effort at VictorOps.

Because SRE is an integral part of our DevOps culture dedicated to accountability, reliability, and collaboration, I sat down with three members of our SRE council to discuss what SRE means to them.

Creating a Culture of Reliability

What is SRE to Me? (Interviews with the Team)

Jonathan Schwietert (Platform Engineer & SRE Team Lead)

Jonathan-S-1

Q: What do you like about the SRE council?

A: It prevents throwing reliability over the wall and protects our DevOps mentality. The people on the SRE council are developers who’ll be writing the next feature, so it’s important that we’re thinking concurrently about reliability while building features.

Q: So what does SRE mean to you, personally?

A: SRE is a way to improve cross-functional collaboration and visibility to help add reliability into everything we build. SRE provides an avenue to direct the hunger that we have for reliability.

Q: What is the ideal SRE team structure to you?

A: There’s no ideal structure. Implementing an SRE structure is dependent on a large number of factors including the size of the organization, the product, the people, and the culture of the organization. SRE will look vastly different if you have a young, fairly unstable product vs. that of a mature, more stable product.

If your product is fairly unreliable, you’ll have a more retroactive SRE structure, whereas a more reliable product will be able to have a more proactive SRE approach. There’s always a ratio to balance between proactive and retroactive SRE, dependent on where you’re at in your product’s development and life cycle. Because of this, there simply isn’t a one-size-fits-all approach to SRE, absolutely not.

Q: What’s your favorite part of SRE?

A: SRE shines a light on the unknowns in a running system. It eliminates fear associated with the unknown. SRE brings awareness to the maintainers of a system and builds confidence toward handling unknown issues.

Q: How do you measure SRE efforts and effectiveness?

A: SRE is a tough thing to measure. But, I’d measure it as the stability of a system over-time, the frequency of system incidents and severity of incidents. Also, if you think of retroactive SRE as the bottom of a scale, and fully proactive SRE as the top of a scale, where you fall along that scale is a good measurement of SRE’s effectiveness. You can also measure SRE based off lower-level monitoring metrics and SLOs.

Q: Do you think chaos engineering and game days are a necessary part of SRE?

A: Yes–it’s the proactive side of SRE. It’s important to get exposure to multiple parts of the system before a real incident happens. Everyone on the team is more prepared for an issue this way.

Q: Any additional comments?

A: I just want to reiterate that what we’ve done for SRE at VictorOps is what fits our DevOps culture best. This system works for us, but that doesn’t necessarily mean it will work for everyone else. And, of course, we’re continuously improving and adapting our SRE practices as we learn and grow.

DeAndre Carroll (Software Engineer - Middle Tier)

DeAndre-C-1

Q: How long have you worked in SRE? Is VictorOps your first time?

A: I’ve worked in SRE before VictorOps, but not in a formalized setting. Part of my previous responsibilities involved hack-testing, trying to figure out how to exploit the system in order to use it for how it was designed. I was working to stop automation and exploitation from taking advantage of our platform. So, I spent some time looking at the site outside of the rules to check for robustness.

Q: How does your role specifically influence reliability in the system?

A: Well, middle tier is the first point of contact for a lot of external systems, we’re the gateway to a lot of systems–APIs, integrations, etc. Middle tier needs to build reliability in an inherently chaotic part of the system. Any server side change that modifies the user experience and comes through the web client or mobile client goes through us. We’re influencing reliability on both the server-side and the client-side. Middle tier is always working to transfer data reliably, implement proper server-side checks, and optimize the client-side for good UI and experience.

Q: If you could, what would you like to do or change about SRE at VictorOps?

A: Getting more buy-in would be great. That’s why we’re adding rotations to our SRE Council in order to give more people exposure to SRE. If the SRE Council becomes static for too long, it can become a silo–which is not conducive to our DevOps nature.

Q: In addition to reliability, do you believe SRE improves collaboration and development speed?

A: Collaboration? Yes. Development speed? Maybe. Over time I believe that SRE will improve development speed, but it’s a highly front-loaded process. It takes time to set up the monitoring tools, alerts, and processes to make SRE more proactive. But once you’re in a good place, it allows for more time to be spent on development.

Q: What’s your favorite part of SRE?

A: My favorite part is being able to share information across all our teams and improve visibility across the entire system.

Q: Do you think chaos engineering and game days are a necessary part of SRE?

A: Yes. Having formalized chaos engineering processes lets teams break away from everything else they’re doing and focus in on making the system robust. You can get real results in real time.

Q: Any favorite tools for SRE?

A: Jmeter. Jmeter is a good automated testing and load testing tool.

Q: So, what does the term SRE mean to you?

A: SRE means challenging ourselves to make the system as reliable as we can. SRE is about getting out of your comfort zone, changing your thinking, and approaching problems from a different angle to make the system more reliable and robust. By finding the strengths and weaknesses of your system, SRE helps you make it more reliable for customers and easier to manage for the team.

Andrew Fager (Data Engineering Team Lead)

Andrew-F-1

Q: So, what’s your role on the SRE Council?

A: I represent the data team. I work on building persistence, reliability, and resilience within our data engineering pipelines.

Q: How do you think your role on the SRE Council specifically influences reliability in the overall system?

A: Well, customers rely on our data to remediate incidents. My team is different from other teams in the sense that we need to ensure that our data is in a reliable and persistent place so that our customers can analyze it after the fact. We can work to find ways to break the data pipeline, but ensure that the data is retained. This way, we can build more robust data pipelines for ourselves and our customers.

Q: What do you like about the SRE Council?

A: Different perspectives. Everyone on the Council has a different expertise and way of approaching issues. It’s interesting to learn how our whole system is interconnected, how we can improve cross-functional communication, and the way that the reliability of one team can affect another.

Q: What is the ideal SRE team structure to you?

A: Blue team, red team type game days are great. Your team will learn a lot through simulated chaos on the system. Malicious actors on one team can take things down while the other team responds to the issue in a realistic on-call scenario.

Q: How do you measure SRE effectiveness?

A: Well, with the data engineering team, there are usually more lagging metrics and less actionable ways to measure SRE’s effectiveness. We can essentially measure the effectiveness of SRE based on the speed of consistency of data in ETL. If ETL lags, how can we bring it up to speed quickly? How can we build a more resilient ETL pipeline?

Q: What’s your favorite part of SRE?

A: I love the collaboration it brings. SRE has great data-driven aspects to it. You need visibility into the system and its performance metrics in order to allow people to look at the data and take action on it. Also, game days are a fun part of SRE that can give cross-functional knowledge to team members so incident resolution becomes easier.

Q: So what does the term SRE mean to you?

A: Whole system reliability. A lot of it is about working to mitigate and prepare for unknown unknowns. There’s no way you can know everything that could happen on your site, so it’s important to be able to react quickly in a non-fatal way.

Start building the resilient future faster with a culture of accountability and collaboration. For free, download the full story about how implementing a DevOps culture of SRE improves system reliability.

Ready to get started?

Let us help you make on-call suck less.