What to Know: Becoming a Reliability Engineer

Becoming a Reliability Engineer (SRE) Blog Banner

The world of defined roles for site reliability engineering (SRE) is relatively new. The principle was first defined and implemented by Ben Treynor, VP of Engineering at Google. In an interview, Ben defines SRE as, “what happens when you ask a software engineer to design an operations function.” But, I would take it a step further and say that SRE is also when you ask an IT or operations professional to take on software engineering responsibilities. Becoming a site reliability engineer is highly dependent on your ability to continuously improve, learn new skills, and try new things.

DevOps principles are inherent to collaborative SRE teams with a culture of shared responsibility and code ownership. Becoming a reliability engineer means that you have an understanding of both development and operations, and you can use that knowledge to build reliable, flexible services.

If you’re new to the world of SRE and interested in becoming a reliability engineer, we’ve put together this article to showcase the things you should know, and some suggestions for ways to learn these skills.

What is a Reliability Engineer?

To learn about being an SRE from the horse’s mouth, we had previously interviewed a few members of our own SRE Council to ask them what SRE meant to them. But generally, a SRE is someone who can write code, conduct thorough post-incident reviews, and maintain highly available, performant systems. In order to understand reliability engineers, we need to first examine the historical divide between IT operations, system administrators, and developers:

SysAdmins and IT Operations

  • System administrators and IT professionals were traditionally responsible for deploying and maintaining code written by developers. SysAdmins would set up monitoring and alerting tools, respond to incidents, and escalate issues when necessary. The responsibility of configuring complex systems and backups, aggregating software components, deploying them, and maintaining the availability of these services fell to SysAdmins.

Developers

  • The creation of new features and services, the actual writing of code, fell to developers. Developers would work with product managers to build functionality determined by the product roadmap, then throw the code over the fence to IT operations teams. Over time, people began to see that siloed development and operations teams led to a large number of issues in complex, integrated systems.

DevOps Teams

  • The issues created by siloed SysAdmins and developers led to the concept of DevOps. DevOps ideals support the concept of taking ownership for the development and upkeep of the code you write. Reliability engineers need to understand the entire software delivery lifecycle (SDLC) in order to build, deploy, and maintain highly available services. SRE within a culture of DevOps bridges the gap between developers and IT teams, exposing engineers to the entire system, and ultimately helping you build more robust services.

DevOps & The Incident Lifecycle

Importance of SRE

Reliability engineers will bolster the observability, flexibility, and reliability of your entire system. SREs add visibility and help create more robust services through the practice of monitoring and alerting, continuous learning, chaos engineering, game days, and collaborative post-incident reviews. With a SRE team constantly iterating on development and operations processes, products and services are built faster and more reliably.

To become a reliability engineer, you need to be capable of writing code, maintaining services, and responding when an incident occurs. Software becomes more reliable when engineers take ownership of the code they write and take on-call responsibilities. With more exposure to the system in production, engineers can resolve incidents more quickly, and teams can conduct collaborative post-incident reviews to correct problems and prepare for any other potential issues.

As a reliability engineer, you are the key to both customer satisfaction and software development speed. As systems become more integrated and complex, it’s important to have engineers dedicated to the reliability of the services you create.

Questions to Ask Yourself

When deciding if you want to become a reliability engineer, there are a number of questions you should ask yourself. It’s important to go through these steps not only to determine if you’d be good as a SRE, but also to determine if you’d enjoy being a SRE:

  1. Do you like thinking about the scalability of projects you’re working on?

  2. Do you imagine failure possibilities and think about failover capabilities or incident response scenarios? Even more, does it excite you to run intentional chaos through your system to identify areas for improvement?

  3. Do you think about how code could potentially affect other integrated systems?

  4. Are you okay with exposure to both development and IT operations? Having this breadth of knowledge and comfort working in both environments is essential as a SRE.

  5. Are you comfortable developing systems and products that may never be seen by external customers?

  6. Can you handle the pressure of being on-call and responding to an incident? Many times, being a reliability engineer means being able to handle on-call responsibilities.

Answering yes to these questions typically means you’d be a good fit as a reliability engineer.

Learning Resources for SREs

Part of becoming a reliability engineer relies on a desire to continuously learn and iterate. By keeping up with current ideas, tools, and trends in the SRE community, you become better-suited to create the most reliable services. Below are just a few sources you can monitor for top-notch SRE news:

SRE and DevOps

SRE and DevOps go hand-in-hand. Reliability will naturally become more apparent when your team implements a DevOps culture of collaboration and accountability. Becoming a reliability engineer is an integral part of bridging the gap between developers and operations teams.

SRE helps you find ways to improve observability, optimize development processes, add visibility to monitoring and alerting data, and conduct thorough post-incident reviews to prepare for future incidents.

Learn more about the importance of instituting a culture of DevOps and SRE. You can check out the story of how (and why) we built a collaborative team focused on DevOps and SRE in our free eBook, “Build the Resilient Future Faster: Creating a Culture of Reliability.”

Ready to get started?

Let us help you make on-call suck less.