VictorOps is now Splunk On-Call! Learn More.
Site reliability engineering is a discipline continuing to gain more traction in software development and IT. SRE was initially implemented by VP of Engineering at Google, Ben Treynor, and popularized through Google’s SRE eBook. SRE is at the crossroads of software development and IT operations – or in Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.”
Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems. SRE lives somewhat in the shadows – contributing greatly to the team’s overall productivity and the reliability of the team’s applications and infrastructure. If constantly improving the efficiency and resilience of the software delivery lifecycle appeals to you, then you should look at working in SRE.
So, we’ve put together this site reliability engineer interview guide so you know what to expect when heading into your next SRE interview.
A site reliability engineer is essentially the perfect mix of a software developer and a traditional operations organization. Like IT professionals, SREs are highly skilled at identifying weaknesses and blind spots in their infrastructure and systems. But, unlike traditional IT, SRE teams also have the autonomy and ability to write and deploy code that proactively fixes problems and avoids incidents.
SRE inherently feeds into a forward-thinking, efficient DevOps culture. By taking the time to identify reliability concerns and building a team dedicated to addressing them, you’ve already started to shift reliability and testing further left into the development lifecycle. Additionally, SRE helps feed IT concerns and information back into the development teams – leading to faster, more resilient software development.
SRE helps break the stereotype that developers don’t take accountability for the services they build. Along with DevOps methodologies, SRE helps bridge the gap between IT and developers. And, even if your team still believes in the “throw-it-over-the-wall” mentality between traditional IT and development, SRE teams can still retroactively add value to your systems. By running tests in production and continuously adding new functionality dedicated to resilience, SRE teams constantly find new ways to make people, processes and technology better.
The first question you need to ask yourself is, “Do I want to work as an SRE?” And, in order to answer that question, you need to know what you’re getting into. Even before you start interviewing for that next SRE role, you should know what common responsibilities fall in the SRE realm.
Site reliability engineers are tasked with writing and deploying code that will improve the resilience of applications and infrastructure. But, SREs are also in charge of improving visibility into system health – leading to deeper insights and helping teams better prepare for incident response and remediation. Not only are engineers made aware of issues more quickly but customer support and other business teams can take action to proactively address customer concerns.
Having an SRE team on hand is greatly beneficial to rapid remediation of support escalation incidents. Site reliability engineers have the most visibility into the entire system and they have software development skills – helping them quickly identify an incident’s root cause and actively solve the problem. So, a dedicated SRE team means fewer support escalations will occur as time goes on and it also means support cases won’t sit in an inactive queue for long periods of time.
Being part of an SRE team doesn’t always mean you’re on-call. But, oftentimes, SRE teams are tasked with being on-call and also managing the on-call experience. Because of an SRE’s insights into the entire stack, they’re typically the best for creating a holistic system for proactive on-call incident response. They can tweak monitoring thresholds and tools, update on-call schedules and make constant improvements to escalations and incident response.
Site reliability engineers should be the end-all-be-all for system observability and incident response. The SRE team will be most equipped to see issues across the entire service and needs to maintain as much context and historical information as possible. By keeping accurate, detailed documentation and continuously improving the reliability of production systems, they can then share “tribal” knowledge across software development and IT teams. This leads to a better understanding for all teams about how services interact with each other in production.
Blowing off post-incident reviews or not conducting them thoroughly can lead to less reliability and a lack in incident preparedness. By continuing to hold post-incident reviews for problems in production, you learn more about how your system works and can improve upon weaknesses. SRE teams are often tasked with maintaining a culture of continuous improvement by conducting blameless post-incident reviews that uncover problems in the system – helping the team identify ways they can build deeper reliability into their architecture and processes.
While every engineering and IT organization is built differently, there are a few common questions you can expect during an SRE interview. Below, we’ve curated a list of questions and answers to help you prepare when heading into an SRE interview.
The answer to this question will vary from team to team. But, generally, it’s an opportunity for you to highlight the importance of SRE and how you’ve used site reliability engineering in the past to bolster resilience and productivity. Some organizations will have dedicated DevOps teams where others will simply follow DevOps methodologies. You’ll appease the interviewer as long as you’re thoughtful about the way you’ve used SRE in the past and how you see it contributing to overall reliability and efficiency in IT and software development in the future.
Like most other job interviews, it’s important to show why you’re excited about the role. SRE isn’t always viewed as the most luxurious role, and many developers will shy away from it. So, it’s important to speak to why you’re excited about building services that improve system reliability and lead to greater customer and employee happiness. Being part of an SRE team should excite you because you’ll be able to make a large impact that affects everyone from product managers to end users.
At first, this seems like a simple question – but it’s a loaded one. The interviewer wants to determine your ability to analyze your deployment pipeline and make intelligent decisions for changing it. SRE teams are crucial for identifying monitoring deficiencies, deployment bottlenecks and surfacing reliability concerns to the applicable parties. Being able to determine where your team can make the biggest improvements to resilience without drastically affecting employee productivity or process will show that you’re able to problem-solve at a high level.
This is an excellent technical question to determine how you’ve set up monitoring and alerting tools in the past and how you’ve helped define the “healthy” state of a system in the past. If you want to join an SRE team, you’ll need to understand how you can leverage both internal and external outputs to determine overall system health. Then, you should be able to translate that information into insights and action for IT and engineering teams.
This is a quick yet obvious question. Of course, the interviewer wants to know if you’re familiar with the languages and technical systems you’ll need to use in order to do your job.
Because of SRE’s involvement in so many aspects of the engineering organization and business, it’s important that you can identify human bottlenecks in productivity. With this question, the interviewer is trying to determine how you would go about solving issues between cross-functional teams. Most of the time, it’s as simple as finding ways to improve the communication and visibility across different departments – helping people find the information they need when they need it.
Being a steward for on-call efficiency and quality of life will likely be a core responsibility for any site reliability engineer. So, for any SRE interview, it’s likely you’ll need to show how you would go about setting up a humane on-call experience. What can you do to improve the on-call experience? Make sure you address this question from the viewpoint that on-call isn’t simply about processes and tooling – but that people need to be a core focus when setting up your on-call rotations and alert rules.
Being an SRE can be one of the most fulfilling roles you’ll ever have on an engineering team. You should have the autonomy to make organizational changes and run experiments that lead to greater reliability in the system. And, many times, you’ll find yourself in a position where you can make the lives of customers and colleagues much better. Also, you’ll become educated in a number of IT and software development disciplines – improving your knowledge of the entire software delivery lifecycle and making you a better developer.
Check out the story of how we adopted SRE and DevOps principles to bolster service reliability without slowing development speed. Download the free eBook, Build the Resilient Future Faster: Creating a Culture of Reliability to get even more insights into the evolution of SRE at VictorOps.