VictorOps is now Splunk On-Call! Learn More.
For a while, the world of Agile software development, continuous delivery and integration (CI/CD) pushed the value of speed over reliability. For years, IT professionals and developers worked together for long periods of time – six months to a year – to create, test and release software to customers. As cloud-based systems and Agile practices became more and more common, developers realized they could provide new services to customers faster than ever before. So, for the time-being, speed was the new shiny object for software engineers – even if it came with some pitfalls for IT operations.
While developers were moving faster than ever, IT teams were tasked with deploying this code reliably at the same speed. However, a large amount of operations teams weren’t equipped to deal with this velocity and were finding large backlogs of work building up. At a faster pace than ever before, IT operations were being asked to take code they’ve never seen before, test it, configure it for their environments and ensure a seamless release with little to no context. Then, they’d deploy the code and find problems in production anyway, forcing them to spend more time on incident remediation and upkeep.
Large bottlenecks in the release management lifecycle began to show up and uncontrolled speed was leading to instability in production. So, high-performing IT operations and engineering teams started looking for solutions to this problem. DevOps practices first popped up as a way to tighten the relationship between developers and IT teams without slowing the velocity of delivery. But, to even further bolster resilience, Google created the first-ever site reliability engineering (SRE) team – a group of developers who can apply engineering expertise to operations problems.
Waterfall development led to deployment bottlenecks and silos between development and operations. The need for rapid development and continuous delivery forced IT teams to compromise the overall reliability of the services they maintained – causing more incidents and poor customer experiences. Then, DevOps collaboration and transparency brought developers and operations closer together and helped teams streamline release management processes.
The shift-left mindset became more prominent as IT operations were allowed greater input during product planning, began testing earlier in the development process, and developers took more responsibility for the uptime of services in production. Site reliability engineering became a natural addition to the process – allowing software engineers to dedicate their time and development capabilities to operations processes and reliability concerns. Instead of reacting to bugs and incidents in production, SRE teams were formed to proactively identify and remediate incidents. By focusing on observability through improved metrics, logs, traces and dashboards, SREs are able to surface problems faster and give greater alert context to the real-time responders.
As you would think, dedicating software engineers to focus specifically on SRE initiatives and system resilience leads to a greater percentage of successful releases and a faster mean time to acknowledge and resolve (MTTA/MTTR) when incidents do come up. But, why should the business care? What’s the benefit of adopting SRE for your company? Next, we’ll go over some ways to make the business case for SRE and some ways to track and show the value of site reliability engineering over time.
In 2019, almost every company is a technology company. Even if you’re a local restaurant, you’re likely taking advantage of technology for call routing, reservation bookings, menu updates, etc. If your website goes down, you potentially lose a reservation to one of your nearby competitors. Not only is the restaurant losing a booking but they’re likely paying someone to maintain their website and look into the issue – costing them even more money. While this is a small example, it shows the value of ensuring constant uptime and the opportunity cost involved with incident management.
When developers are fixing numerous issues in production, they’re not building new features and services. SRE principles allow engineers to focus on work that drives value within their specific disciplines. Developers can focus on writing new code and pushing out new products, SRE teams can focus on observability and monitoring and alerting, and operations teams can focus on testing, configuration and production upkeep. SRE can act as a highly strategic function within the business, allowing you to focus on the value behind projects and not just projects themselves.
Site reliability engineers can create KPIs and track service health all the way through to costs of downtime or lost productivity. Tying SRE metrics back to the KPIs of the business will show product, sales, marketing and customer support the value brought in due to system reliability. More stringent SLAs, SLOs and SLIs could potentially help the company close more deals and provide a competitive advantage over competitors. But, more specifically, let’s look at some of the key benefits that teams can expect from SRE.
SRE teams spend their time dabbling in a multitude of different areas of an organization’s systems. Out of any group within the organization, site reliability engineers have the greatest understanding of how everything in the system is connected. So, they know the best way to track metrics, logs and traces across disparate services and depict a holistic picture of system health. And, if an incident occurs, the observability is already there so on-call responders can find the context they need when they need it.
SRE inherently encourages a culture of DevOps. Site reliability engineers fit perfectly in that gap between developers and sysadmins, helping find ways to improve automation and communication that benefits both teams. In many organizations, the release process can feel like Dev vs. Ops, but in reality, the entire engineering team is equally responsible for facilitating a reliable, speedy CI/CD pipeline. SRE can expose areas for improvement in the release pipeline while also creating rules around the culture of on-call availability and incident response that encourages everyone to be more accountable.
No matter how large or small your organization is, you need a system for responding to application and IT infrastructure alerts. With larger organizations, this was traditionally done through a centralized command center – the network operations center (NOC). NOCs were responsible for triaging all incidents and alerts coming into the system and figuring out how to route those alerts to the right person. Well, SRE is using automation, machine learning and a deep understanding of a system’s operations to move toward a modern NOC where alerts go straight to the person responsible for fixing the related problem.
SREs will also know the best way to build an on-call process and how to optimize the system for alerts. Every tech stack and organizational structure will differ, so it’s important to overall incident management efficiency that site reliability engineers take a step back and think objectively about the best approach to on-call schedules and alert rules. If SREs aren’t on-call, they can also help objectively determine the best way to route alerts through systems – whether they’re giving that responsibility to developers or operations teams. But, in many cases, it can make sense for site reliability engineers to also be on-call.
Site reliability engineers have the most visibility into what’s wrong with production environments and how these reliability concerns affect customers. So, while it’s SRE’s job to create visibility into service health to improve incident response, it’s also important to point out flaws that need to be fixed and prioritized in a team’s product roadmap. Reliability is a feature and SREs are responsible for continuously delivering customer value through greater reliability in the software delivery and incident management lifecycles.
The level of value provided by SRE initiatives will vary from team to team. But, the value of reducing customer churn rates and closing more deals due to improved service resilience can’t be overstated. SREs need to always remain customer-focused first, business-focused second, and engineering-focused third. If making adjustments to an incident response process will improve customer experiences more than optimizing monitoring dashboards, then upgrading the incident response process should take precedence.
Once your SRE team starts delivering value to the engineering organization and the overall business, you need to figure out how you visualize the success of SRE and track it over time. Ensure that any major SRE initiatives have clear delivery and incident management KPIs that can showcase the exact value of a project – no matter whether those metrics are tracked via money, time or some other applicable metric. It’s great to spout off the positive effects of SRE but it’s even better to show this value through real data.
SREs don’t rest on their laurels and wait for projects to come to them. Site reliability engineering proactively identifies areas for improvement and gives people the autonomy to implement solutions. Developers and operations teams don’t need to argue about who’s accountable for what and when it’s the right time to focus on reliability vs. speed – SRE can help you prioritize this problem. Resilience shouldn’t be looked at as a task that takes time away from development, it should be looked at as a feature within the broader scope of software development. You don’t need to choose between reliability and speed. With SRE, you can have both.
Learn more about the day-to-day implementation of SRE and how teams are strategically using site reliability engineering to create a competitive advantage. Read our guide, Resilience First to see how we’re using SRE and the four golden signals of monitoring to create a development process focused on both speed and reliability.