VictorOps is now Splunk On-Call! Learn More.
At first glance, site reliability engineers (SREs) might look distant from customer needs. Because SREs often focus on DevOps and IT problems of nearly all sizes, it’s easy to assume they’re unfamiliar with customer needs. But, when you dive into the practice of well-done, modern site reliability engineering, you’ll actually find it’s one of the engineering disciplines with the highest alignment to customers.
A large portion of SRE roles and responsibilities includes the collection, aggregation and visualization of key application and infrastructure metrics, logs and traces in order to build observable systems. Additionally, due to their exposure to numerous aspects of the system and their software development expertise, SRE managers often control the engineering and IT organization’s on-call schedules and alerting practices. So, it’s easy to see site reliability engineers as solely focused on internal engineering policies, systems and tools to improve the reliability and speed of CI/CD pipelines.
But, where do SREs go to figure out what metrics are important for their applications and services? And, how do they define the right service-level agreements, objectives and indicators (SLAs, SLOs and SLIs) for their business? Well, these SRE insights and practices all come from looking at customer needs, experiences and asks. So, site reliability engineers are really on the frontlines when it comes to understanding customer expectations and using observability insights to help engineering and IT teams create more resilient systems.
In our latest SRE webinar, you can listen to Splunk Cloud Platform SRE, Jonathan Schwietert, talk with me, Chris Riley, a DevOps Advocate with Splunk, all about customer-centric SRE. Let’s dive a little deeper and explore some highlights from this conversation to help you start building a customer-first SRE and observability practice.
In Resilience First, our comprehensive guide to SRE and the golden signals, we defined SRE as a functional way to apply software development solutions to IT operations problems. But, that’s really just the tip of the iceberg. SREs are also in charge of aggregating observability and resilience insights/practices with everyone in the software engineering and IT organization. While teams often adopt a DevOps mindset to improve collaboration and transparency between cross-functional engineering and IT teams, SRE is the second level. SRE doesn’t replace a DevOps culture, it enhances it.
In the webinar, SRE, Golden Signals and Happier Customers, Jonathan Schwietert touches on the idea that site reliability engineers are basically in charge of making reliability improvements based on data, not feelings. With the right mix of metrics, traces and logs, SREs can better understand their application and infrastructure health through deeper observability. With deeper observability, site reliability engineers can influence changes to DevOps practices and tools to enhance the way engineering and IT teams deliver reliable services to production. Again, all of this with the caveat that the prioritization of all projects tie back to customer pain points and requests.
Also in our recent webinar, I ask the question, “Do you just do what Google does?” And, with this question, I’m really asking, “Do you simply follow in the exact footsteps of large, successful software engineering and IT teams when you approach SRE and monitoring practices?” Metrics tracked through the RED method and the golden signals are laid out in Google’s comprehensive SRE book and are a great starting point for observability if you need guidance. But, the key to real observability and resilience is through finding your golden signals.
By understanding the way customers use your applications and services, you can better understand the types of white box and black box metrics that can lead to happier customers. You can take insights from highly successful engineering teams like Netflix, Google or Amazon in order to give you a place to start when building out SRE and observability from the ground up. But, it’s important for scalable SRE to think critically about every component of your team, application and infrastructure in their own light.
Do you really have the same kind of resources as an engineering team at Netflix? Also, if you’re not a content streaming platform, then you’ll have fundamentally different key metrics depicting true reliability anyway.
SREs with a customer-first mindset will always find the balance between people and technology when approaching observability, monitoring and alerting. You could have all the data in the world about how your applications and infrastructure work with each other and their overall health. But, if you can’t provide this data to engineers and IT analysts in an easy-to-consume way then you’ve essentially added no value. Observability is about limiting your system’s unknown-unknowns, understanding more of your systems known-unknowns and enabling engineering and IT teams to do more with this information.
The basic building blocks of an observable system start with metrics, traces and logs. Metrics can help you know if there’s a problem, traces allow you to find where that problem is and logs give your team the ability to diagnose the problem and fix it. But, incidents are bound to occur in any system, no matter how observable. And, observability without action is just empty data. So, on top of your observability KPIs, you need a way to quickly notify the proper developers and IT professionals of problems, with context, and give them an avenue for easy collaboration. This holistic approach to SRE helps you see the value of customer insights to observability to action.
No developer wants to write code in an echo chamber. End-users drive software businesses and allow developers, sysadmins and SREs to come to work each and every day. Software engineers and IT practitioners alike are learning to see their systems through a customer’s eyes and think creatively to improve the observability and resilience of their applications and services. SREs are connecting the dots between customers, technical systems and the DevOps-minded teams who support them.
Check out the recording of SRE, Golden Signals and Happier Customers to learn how you can get the most value out of customer-first SRE and observability practices.