VictorOps is now Splunk On-Call! Learn More.
Service reliability should be a core responsibility for every software development and IT team. Due to the need for resilience in a world of continuous integration and delivery (CI/CD), more teams are adopting DevOps principles to improve cross-functional collaboration and transparency. And, on top of DevOps adoption, teams are now proactively addressing reliability with site reliability engineers (SREs).
Site reliability engineers are basically software developers who apply their skills to IT infrastructure and application problems – helping build new features and services to address system reliability concerns. So, site reliability engineering managers are tasked with prioritizing projects, improving system observability and making sure SRE teams stay on task. SRE managers have one task above all else – ensure resilience across all systems and DevOps workflows.
Before we look at more specific responsibilities of SRE managers, let’s first look into the importance of the SRE discipline and how teams are using it:
According to Wikipedia, the discipline of SRE began in 2003. An engineer named Ben Treynor was hired by Google to lead a team of seven software engineers who were in charge of running one of their production environments. And, according to a Google interview with Ben Treynor, he defines SRE as, “Fundamentally, it’s what happens when you ask a software engineer to design an operations function.” Applying developer expertise to IT operations problems can lead to improved observability and more proactive solutions for incident detection, response and remediation.
As the adoption of DevOps and SRE continues to grow, team structures, processes and technology are changing too. The DevOps methodology continues to help teams improve feedback loops between software developers and IT operations while SRE is helping DevOps teams proactively build reliability into their services without disrupting the CI/CD pipeline. Site reliability engineering managers need to keep this breakdown in mind while structuring their teams, executing projects and identifying priorities.
So, let’s take a look at some of the core day-to-day responsibilities you can expect as a site reliability engineering manager:
The SRE manager is in charge of managing a team of people dedicated to proactively building reliability into the product. Because reliability in highly complex, integrated systems typically crosses between multiple programming languages, third-party services and integrations – as well as software and hardware – an SRE team needs to be multi-talented. Each individual in an SRE team should be highly skilled in one or two fields with a wide breadth of knowledge in many other IT operations and other software development skills.
So, of course, SRE managers also need a breadth of knowledge and an ability to pull different disciplines together for one common goal – proactively building resilience into IT infrastructure and applications.
Along with the normal administrative work required from a people manager, SRE managers need to know how different disciplines can come together on an SRE team. Oftentimes, SRE teams will act somewhat independently from other engineering teams and need to work with some level of autonomy. But, it’s important that site reliability engineering managers are also well-connected with the broader IT, engineering and business teams – staying up-to-date on feature development and how it could affect the system’s overall reliability.
Service-level objectives, service-level agreements and service-level indicators are essential to SRE teams. The site reliability engineering manager will define what it means for the system to be ‘available’ and dictate the availability SLO (internal metrics) of the system. Then, the SRE manager needs to provide an SLA to business teams and the rest of engineering to show how much availability they can promise to customers. Then, the team can start to track SLIs to evaluate whether the system is meeting the required percentage of availability.
Site reliability engineering managers are also in charge of project planning and task prioritization. It’s important that SRE managers sit in on quarterly planning and sprint planning with the greater engineering and IT teams. Then, the site reliability engineering manager can assess the key objectives for the next few sprints and pass those priorities to the rest of the SRE team. This way, the SRE team can begin building features and functions that proactively monitor the health of new features, communicate observations to the rest of the team and add reliability to the overall architecture.
Now, this isn’t always the case, but SRE managers are often tasked with optimizing the overall on-call process. And, if they aren’t in charge of the broader team’s on-call incident response workflow, they’re at least managing the SRE team’s on-call rotation (as long as SRE is part of the larger on-call organization). But, because incident response is such a major part of maintaining uptime and handling reliable services, these responsibilities should usually fall on the SRE manager. SRE teams have so much input and historical knowledge across the entire system, they’re likely the best-equipped team for setting on-call rotations, alert rules, communication methods and incident response plans.
What are the best ways for the team to communicate during development, deployment and incident management? Is it always through the same channel? Or, do the channels differ by time of day, team, or the issue that’s being discussed? Do you need to spin up a new conference call or a Slack channel for every issue that comes up? Then, how do you track historical communication and pull that information into post-incident reviews? A site reliability engineering manager has visibility into how teams across engineering and IT are working and can establish communication best practices throughout the entire software development lifecycle. Then, the SRE team will track the effectiveness of these practices and iterate when necessary.
More technically, the SRE manager will be tasked with improving the overall observability of the team’s applications and infrastructure. In Google’s SRE eBook, they laid out the four golden signals of SRE monitoring. The four golden signals include latency, traffic, error rate and saturation. While these signals are only the start of building a highly observable system, implementing the four golden signals is a great start for any SRE manager. Without observability, site reliability engineers will have a difficult time identifying areas for improvement, prioritizing future work and learning from the way their system behaves.
Last but not least, site reliability engineering managers should take advantage of chaos engineering principles and proactively run tests through their applications and infrastructure. By learning about your technical systems through chaos engineering and taking advantage of game days to practice the human element of incident response – SRE teams are increasing system reliability at every turn.
Well, as we mentioned before, Google actually wrote the book on SRE. The SRE team at Google is monitoring more of its system’s metrics, improving cross-functional collaboration and automating manual tasks. By moving from manual sysadmin work to automated commands and software systems that solve operations problems, Google’s SRE team is helping engineering teams move faster while simultaneously improving overall reliability. From bettering the monitoring stack to creating SLOs, SLAs and SLIs – Google is improving visibility and collaboration across the organization to drive business value through more resilient systems.
Over the last few years, Airbnb has needed to scale rapidly and reliably. Seconds of website downtime can lead to tens of thousands of dollars in lost revenue – not to mention poor customer experiences. In a ZDNet interview with Cameron Tuckerman-Lee, a site reliability engineer at Airbnb, Tuckerman-Lee said the SRE team “makes sure that the entire site is reliable and available, and we do that by supporting the other teams that own their own applications.” A combination of DevOps philosophies and an integrated SRE team has led to a company that can make multiple reliable deployments per day – helping Airbnb become one of the fastest growing companies of the last few years.
In 2016, Uber released a blog post about the way they approach SRE. While the company has certainly changed over the years, their goals in SRE remain the same – “managing system complexity over time.” The team was constantly striving to show the reality of the way their systems and people operated and then create repeatable processes that ensure reliability without hindering speed or scalability. As you can also imagine, SRE managers at Uber were consistently concerned with observability. Like Airbnb, Uber went through a period of immense growth – creating the need for an SRE team that could allow for scalable, rapid deployments without hindering service reliability and customer experiences.
A post from a Dropbox SRE, Krishelle Hardson-Hurley, came out a few years ago and continues to get a lot of traction. She discusses the journey from front-end engineer to a site reliability engineer on Dropbox’s monitoring team and what she likes about being an SRE. In the words of Tammy Butow, the site reliability engineering manager at Dropbox said, “SREs are software engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.” And, as a file storage system, availability and uptime are essential to Dropbox’s customers – making SRE one of the most important disciplines at the entire company.
As with many others on this list, LinkedIn released a post about their own SRE culture in 2017. In the article, they quote David Henke, Head of Engineering and Operations at LinkedIn, as saying about promoting an SRE mindset of “attacking the problem, not the person.” The team is hyper-focused on building a blameless culture focused on continuous improvement. In the early years of LinkedIn, they spent a ton of time simply responding to incidents and “fighting fires.” So, as time went on, they found it of utmost importance to proactively identify issues, improve monitoring and alerting practices and reduce the number of production incidents that come up. Over time, LinkedIn’s SRE team improved the overall resilience of their systems through a constant dedication to efficiency, automation and collaboration.
At first glance, being a site reliability engineering manager might seem overwhelming. But, an SRE-mindset should exist throughout all of engineering and IT – not only on a dedicated SRE team. In fact, many DevOps-minded organizations don’t have a siloed SRE team and are actively integrating SRE roles into all of engineering. For example, we created the SRE council and rotated people in and out on a quarterly basis to bring stakeholders together from multiple engineering teams to improve cross-team collaboration and visibility.
SRE managers should have a passion for technical expertise, collaboration and organizational transparency. Then, the site reliability managers main duty is to spread this passion amongst the team. As more people buy into the SRE process, it becomes ingrained in the organizational culture – making reliability a core principle for all business and engineering operations.
See the story of how we implemented our own culture of DevOps and gained organizational buy-in for SRE practices. Download the free eBook, Creating a Culture of Reliability, to learn from our own story and start bolstering SRE at your own organization.