Effective site reliability engineering (SRE) relies on a deep understanding of a service’s underlying infrastructure and architecture. Improving the visibility into application and infrastructure health is just the start to proactively creating reliable systems. But, the four golden signals of SRE are the best starting point for monitoring the health of your systems. Once you’ve established these base-level monitoring methods, you can continue to improve system visibility from there.
With improved visibility alongside efficient methods of collaboration, SRE teams can quickly see the health of their system and take action to remediate incidents, improving the overall effectiveness of monitoring and alerting efforts. The SRE golden signals help teams identify any potential weaknesses in reliability so you can start proactively addressing infrastructure concerns. So, let’s discuss the relationship between monitoring techniques and SRE teams, and look at how the four golden signals factor into the process.
In part three of our DevOps Dictionary, we looked around the internet to help define the discipline of SRE. According to SRE’s Wikipedia entry, “Ben Treynor, founder of Google’s Site Reliability Team, [says] SRE is ‘what happens when a software engineer is tasked with what used to be called operations.’” SRE combines the responsibilities and abilities of software engineering with IT operations problems to help teams build solutions to reliability concerns. So naturally, SRE teams need to monitor their services in order to identify areas where reliability can be improved.
That’s where monitoring fits in for SRE teams. Although monitoring is only one small part of creating highly observable systems, it’s an important top-level element for understanding the health of your applications and infrastructure. The four golden signals of monitoring and SRE help create a baseline layer of visibility into the reliability of everything you build. Once you feel comfortable with the level of visibility into the health of the golden signals, you can leverage this additional system understanding to go even deeper with your monitoring tools.
Now that you see the importance of monitoring the golden signals of SRE, let’s dive into the actual metrics that make up SRE’s golden signals.
When you embark on the journey of refining your monitoring efforts, understanding where to start can be tricky. The four golden signals of SRE and monitoring were first defined in the Google SRE book and are now commonly used within many teams. The golden signals are a great place to start as they can help you establish the core metrics you should always be tracking.
So, let’s dive into SRE’s golden signals and see why monitoring them is essential to the reliability of any system.
Latency: How long does it take to service a request? Define a benchmark for the latency associated with successful requests and monitor that against the latency of failed requests. Tracking the latency of errors allows you to address any concerns around the speed of identifying an incident and how quickly you can dive into incident response.
Traffic: Traffic is fairly self-explanatory. How much stress is your system taking on from the number of users or transactions running through your service? Depending on the functionality of your service, measuring traffic can look quite different from company to company. By monitoring real-user interactions and traffic you can better understand the way end-users experience your service and create visibility into how your systems hold up under stress.
Errors: Of course, every team needs to monitor for errors. Whether those errors are defined based on manually defined logic or they’re explicit errors such as a failed HTTP request, SRE teams need to monitor for them. Many SRE teams use incident management software to alert on critical errors, take action to identify why an error is happening, and work toward incident remediation.
Saturation: Every team needs to monitor the utilization of their system. It’s important for SRE teams to define a metric for saturation that means the service is maxed out. Most services start to degrade before utilization hits 100%, so understanding the functionality of your own system is important to defining a saturation benchmark that makes sense.
By setting up monitoring and alerting rules for the four golden signals, you can ensure coverage for most major incidents in your system. But, you’ll need to dive a little deeper to start creating a proactive system for monitoring and SRE.
While monitoring the golden signals is a great start to understanding incidents in your service, SRE teams of the future are proactively learning more about their system through numerous additional techniques. By running organized tests in both staging and production, SRE teams can actively learn about their systems and use the information to build reliability into their services.
Chaos Engineering: Chaos engineering is a discipline used by teams to experiment on their systems to proactively detect failure points or potential weaknesses. By actively injecting chaos into your service, you can see exactly how the system responds to different circumstances.
Game Days: While chaos engineering is geared toward understanding your system, game days can be used to understand your people. Game days are used to test the resiliency of your team when it comes to incident response and remediation. You can use the learnings from game days to develop more efficient processes or determine the need for new tools that make people more efficient.
Synthetic Monitoring: The use of synthetic monitoring allows teams to create artificial users and simulate user behavior through a service. You can determine specific artificial behavior flows in order to learn more about how your system responds under pressure. Synthetic monitoring is an excellent method for granularly testing and determining the reliability of specific services within your greater system.
SRE’s golden signals need to be monitored by any team looking to visibly measure the health of a system. But, knowing the health and general reliability of a system is far different from taking actions to improve a system’s reliability. In today’s ecosystem of highly distributed systems and rapid deployment, SRE teams have their work cut out for them. But, the golden signals of monitoring and SRE can help you achieve a healthy starting point from which you can constantly improve to become more proactive with SRE.
By centralizing monitoring and alerting, SRE teams can improve team-wide visibility into system health. Sign up for a 14-day free trial of VictorOps to see for yourself how SRE teams use an incident management solution to centralize and collaborate around important monitoring and alerting data.