Site reliability engineering (SRE) has gained traction as an effective engineering practice over the last few years. In IT operations and software development, DevOps and SRE are often conflated into a single definition. But, SRE and DevOps aren’t one and the same. While the job responsibilities can often look similar between organizations, there are a few fundamental ways in which site reliability engineering differs from DevOps.
DevOps and SRE should be looked at as complementary disciplines to one another. Keep these thoughts in mind when assessing some of the basic differences between DevOps and SRE:
DevOps is more of a mindset focused on speeding up the software development lifecycle (SDLC) and tightening collaboration between IT operations and software engineering teams. DevOps deepens developer exposure to production systems and allows operations teams to more easily escalate major problems to the development team. In fact, SRE teams are an integral part of building proactive testing, observability, service reliability and speed into a DevOps-centric organization.
SRE is a way of identifying system weaknesses, testing production environments and solving those issues before they become major incidents. SRE as part of a DevOps-focused team improves the reliability of technical services through deeper collaboration and proactive optimization of redundancies and monitoring and alerting practices.
Google’s SRE eBook states that Ben Treynor, Senior VP overseeing technical operations at Google, originally came up with the term “site reliability engineering.” And, in Ben Treynor’s words, “[SRE is] what happens when a software engineer is tasked with what used to be called operations.” SRE teams are constantly balancing delivery speed with the reliability of the underlying system. It’s a way to bring software development expertise into operations roles so teams can proactively write code and develop services to improve the reliability of the system.
Building an SRE structure will alleviate much of the siloing that exists between developers and IT professionals and help reduce the stress put on operations teams. So, as you can see, DevOps isn’t replacing SRE – it’s making it better. Let’s dive into the ways DevOps and SRE differ and how they work together.
Really, this section shouldn’t be called SRE vs. DevOps, but it should be titled, “SRE in DevOps.” Where does an SRE team fit into a DevOps culture of continuous improvement? Because there isn’t one single way to implement DevOps, you can think of DevOps more like a mindset whereas SRE is more like a role.
Let’s review some of the core tenets of DevOps and some SRE job responsibilities that align with those principles:
An SRE team will be exposed to development, deployment, configuration, orchestration and everything in between. Because of the SRE team’s exposure to numerous services and their understanding of both developer and IT operations responsibilities, they can help spread system knowledge across the broader team and improve visibility of the entire system. Over time, site reliability engineering helps spread a deeper understanding of systems both in production and in development to everyone across the team – helping speed up software delivery and incident management workflows.
SREs sit right in the middle of developers and operations teams – helping bridge the gap between the job roles. They implement new tools and techniques to help developers and operations teammates communicate better and understand each other’s roles. Many times, an SRE can be an excellent point of communication because they normally understand much of what’s happening in the system – both in development and in production.
Site reliability engineers can (and should) share on-call responsibilities with developers and IT operations teams. SRE teams will take ownership of the code they write and the services they maintain or build. Then, when things go wrong, everyone shares responsibility for the service they’ve created.
SREs use the same technology stack as everyone on the DevOps team and helps build new services to improve the observability of the system. The best SRE teams are open to criticism and collaboration from the broader team – leading to further insight to system and process problems. Also, SRE teams will measure absolutely everything, tracking SLIs, SLAs, SLOs and other important incident management KPIs over time to make sure service reliability is improving.
The role of SRE should constantly be finding ways to improve the system and automate manual processes. As automation increases and human error is removed from the equation more and more, service reliability goes up. So, SRE teams who are striving to automate workflows across the entire software delivery lifecycle will improve the quality of life, speed and resilience of your DevOps and IT teams.
The SRE team’s entire job is to constantly improve the reliability of the system and the resilience of the team building and maintaining the system. So, continuous improvement is inherent to SRE operations in a DevOps-centric organization.
As you can see, SRE and DevOps certainly overlap, but they’re not the same. Because DevOps-oriented teams are never structured the same way, SRE needs to be customized to the specific situation. In a previous post, we took a deep dive into how our team developed the SRE council, but thought a quick recap here would be good food for thought.
In order to optimize the time of our software developers, IT and operations teammates, we created the SRE council. The SRE council is made up of individual members from different areas of the engineering and support team (data, middle tier, platform, web, mobile, etc.). By bringing multiple viewpoints and expertise into one room, the SRE council can conduct post-incident reviews, collaborate quickly and assess ways to improve the reliability of the system.
It was important to us that we make sure SRE operations were not siloed. Integrating SRE into normal DevOps workflows helps you increase the reliability of your systems without reducing speed. At least, that’s what works for our team. Make sure you think comprehensively about your tools, processes and people when developing your SRE team. You need to plan out exactly how SRE will integrate with developers and IT operations teammates in order to take on a successful SRE implementation.
When you first start implementing SRE, you’ll need to be patient. It can take time for teams to buy into the process, stop thinking of SRE projects as additional work, and see the value of SRE for employee morale and service resilience. But, over time, a DevOps team dedicated to SRE will reduce downtime, decrease application errors, improve customer experiences and make on-call better for incident responders.
A group of people dedicated to running chaos engineering projects and stress tests is imperative to the proactive development of reliable applications and infrastructure. SRE teams can improve cross-functional visibility to system health and identify problems before they happen. Through synthetic and real-user monitoring, SRE teams can run custom tests through their entire architecture to find bugs and errors before they happen to real people.
SRE improves the observability and reliability of your services through a combination of simulated tests, constant monitoring of system health and detailed post-incident reviews. Without SRE teams, you create reactive on-call teams that run into negative customer experiences, reduced revenue and higher employee turnover.
Combine the power of SRE with DevOps to proactively build reliable services – leading to greater operational efficiency, business value and overall happiness for everyone involved.
Centralize monitoring metrics with on-call collaboration and automation tools in a single-pane-of-glass – improving visibility across the entire software development lifecycle. Sign up for a 14-day free trial or request a demo of VictorOps to share information faster and tighten the communication between DevOps and SRE teams.