World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
Site reliability engineering (SRE) isn’t a new concept or role. But, it isn’t old either. With the growth in Agile, DevOps practices, and remote engineering teams, as well as changes to the traditional NOC and SOC models, SRE is helping fill a void of software engineers dedicated to IT infrastructure and application resilience. Site reliability engineers often have a great understanding of core application architecture and infrastructure, helping them connect developers and IT operations teams – encouraging more collaborative DevOps workflows.
But, there are a lot of competing views around the implementation of site reliability engineering, whether SREs are responsible for on-call rotations and alert management, or even if SRE should be broken out into its own group at all. So, we decided to create a guide, Resilience First, to help cover the most important aspects of SRE in an easy-to-read, shareable PDF.
Since VictorOps is an incident management and response tool focused on making on-call suck less for the people involved in real-time outages and downtime, we’ve learned a fair amount about managing production systems. Our customers range from people practicing the NOC model and managing on-premises data centers and services to SREs and DevOps-minded teams managing hybrid, multi-cloud infrastructure. But, each one of these teams are realizing the benefits of reducing MTTA and MTTR over time to facilitate faster recovery from outages and incidents – leading to more resilient services and happier customers.
Throughout this journey, as well as our own journey into adopting SRE, we’ve found 5 core components for any effective SRE team. Let’s take a peek at these 5 components below, or you can read about them in more detail in Resilience First, our recently published site reliability engineering PDF.
The following five ideals of SRE can lead to better customer experiences through factual data and insights. Observability and practical metrics are the best way for site reliability engineers to facilitate service resilience and infrastructure uptime – giving customers what they expect.
Site reliability engineers will be in charge of developing and meeting service-level objectives, agreements and indicators (SLOs, SLAs and SLIs). Based on the maturity of the underlying applications and infrastructure, as well as the overall team structure and buy-in for reliability practices, SREs can assess reasonable metrics to quantify uptime and availability for customers. What level of availability is reasonable to assume you can consistently maintain and what will make customers and potential customers happy, leading to more business?
Of course, if site reliability engineers are responsible for service availability, they’re also responsible for performance. In a sense, performance is a different way of looking at availability. In the eyes of an engineering team, customers who experience a certain level of latency or another type of performance degradation may as well be experiencing pure downtime. If the service isn’t performant and available, it’s nearly unusable. SREs are in charge of bringing insights and action to these production systems in order to ensure developers and IT teams fix problems quickly, improve customer experiences, and make applications and infrastructure more resilient over time.
In order to ensure performance and availability, SREs need to know what to monitor and alert on in their applications and infrastructure. Observable services drive drastically more efficient development and release teams which naturally drives more uptime and performance for customer-facing services. SREs use white box and black box monitoring together, alongside dashboards and other visualizations to ensure development, IT and security teams everywhere in an organization have a better feel of their application and infrastructure health.
SREs involvement with on-call management and incident response is often different between organizations. While site reliability engineers don’t always need to be on-call themselves, they should at least contribute to post-incident reviews and have visibility into the incident response process at a high level. A large part of system reliability is in the efficiency of DevOps and IT teams when it comes to responding to incidents and outages in production. Site reliability engineers need to be accountable for the success of their incident response teams – often meaning they need to be part of the on-call process.
And, tying it all together, is preparation. SREs need to ensure that developers and IT operations teams have the resources they need to understand their systems, know when something’s wrong and quickly respond to problems. Through collaborative post-incident review processes, useful metrics and dashboards and overall improvements to an organization’s CI/CD process, site reliability engineers have a lot of pull over DevOps and IT efficiency.
Again, to read more details about the thought processes behind structuring an SRE team and facilitating highly observable applications and infrastructure, check out our new PDF, Resilience First.
You’ve likely heard about or read many of these other useful SRE resources, whitepapers, guides, podcasts, PDFs, etc. But, here’s a list of other great reading/viewing material for anyone interested in site reliability engineering:
Interested in seeing how SREs are using VictorOps to improve incident response and alert management to make customers happier while making on-call suck less? Sign up for a 14-day, free trial or reach out to our team for a personalized demo.