August Roundup: What's Hot in SRE?

August Roundup: What's Hot in SRE? 2018 Blog Banner

Site Reliability Engineering (SRE) continues to gain momentum in DevOps teams. Building a foundation of collaboration and accountability leads to reliable continuous integration and delivery. Functional SRE doesn’t stand in the way of Agile development–it builds resiliency into the process. Don’t find yourself tunnel-visioned into speed when reliability is equally important to your customers.

An SRE team needs to have hunger for continuous improvement, chaos experimentation, and process iteration in order to bake reliability into development. Staying up-to-date on the latest and greatest SRE news is part of that equation. So, we’ve rounded up some helpful resources and articles about SRE as part of our end of August roundup.

Observability for the Real World

SRE all starts with building observable infrastructure. From the article above, observability means, “A measure of how well internal states of a system can be inferred from knowledge of its external outputs.” Part of SRE is about taking observable external output metrics, determining the health of your internal systems, creating ways to better understand this data, then acting on your insights to make your system more robust.

This article shows great ways in which observability can be used to help solve real IT and DevOps problems. Observing external system metrics over time can give you important insights into the true health of your software. Read the the full post to learn the ins and outs of building observability into your service–no matter what your infrastructure may look like today.

The 18 ghosts in your infrastructure stack that can cause failure (and how to avoid them)

This is an older post, but it’s succinct and informative. The infographic easily lays out 18 areas of your infrastructure which could create surprising errors or failures. Understanding these weaknesses is the first step in finding where to begin adding reliability and visibility. SRE efforts and effective monitoring can help identify these weak points and bake further resiliency into future development.

Identify your system’s weaknesses, monitor, alert, collaborate, and resolve incidents quickly with a deeper system understanding. Check out the infographic to start finding weaknesses in your infrastructure and see many areas of your system which may need a second look.

10+ Great Books For Aspiring DevOps & SRE Engineers

The title explains it all with this one. This is simply a great resource for anyone interested in finding more reading material about DevOps and/or SRE. You can never stop learning more about the ever-changing topics of DevOps and SRE. Click the link to check out this awesome list of books and become the best engineer you can be.

How to Hold an SRE Council Meeting

We may be biased, but this post gives great insight into the way we approach SRE. Through a few iterations, we developed an SRE council dedicated to our passion for building reliability into our service. SRE Team Lead, Jonathan Schwietert, took time to discuss the methods for holding weekly SRE council meetings and what your goals should be. Read on to learn more about a great SRE structure for smaller teams and how to continuously improve and add SRE value into everything you build.

What exactly is the role of an SRE at Google? How do I become one? Is there a road map of skills and experiences required for such roles?

This is a response given to the question above from an SRE at Google. This serves as great insight to how Google is thinking about site reliability engineering and how someone interested in the topic can get into the field. We, at VictorOps, always like to say there isn’t a one-size-fits-all approach to organizational SRE, but this gives you a great start when it comes to thinking about the subject.

Click this link to understand more about the mindset needed for an SRE team member and the skills that hiring managers are looking for.

SLOs & You: A Guide To Service Level Objectives

Service level objectives (SLOs) are an excellent way of measuring the efficacy of your SRE efforts. The article above serves as a helpful walkthrough for addressing SLOs, monitoring goals, and reaching a point of more proactive SRE. Monitoring SLAs and SLOs allows SRE teams to make more data-driven decisions and find better ways to add reliability into the system. Find more information on SLOs and how they should pertain to you and your SRE teams here.

DevOps and Site Reliability Engineering — Better Together

DevOps creates less of a feedback loop, builds cross-functional collaboration, gives engineers more system exposure, and ultimately, provides more visibility to how everything is working. Adding SRE into a DevOps culture becomes a strong mix of speed and resiliency. Not only can you deliver features quickly, but also reliably–without simply tossing upkeep of the entire production environment to the Ops team.

This story shows why DevOps and Site Reliability Engineering work so well together. It’s important to understand the synergy between the two disciplines and how they’re tied together so tightly. Read more to better understand the close relationship between DevOps and SRE.

A culture of collaboration and accountability creates reliable infrastructure, faster feature development, and happier customers and employees. The free SRE eBook shows you how DevOps and SRE work together to create a win-win situation for reliability and speed.

Ready to get started?

Let us help you make on-call suck less.