Breaking Your System for Proactive Site Reliability Engineering (SRE)

Proactive-SRE-Chaos-Engineering-Blog-Banner

In a previous post, we wrote about identifying potential issues, collecting applicable metrics, and defining smart alerts and escalation structures to create more robust systems. Companies will sometimes approach incident management in a reactionary way. But, a lot of the work can be done proactively.

Let’s get into the nitty gritty with some great tools and practices that proactively help your SRE efforts.

Events

Once you’ve started collecting and organizing actionable metrics, you can use them to gain a better understanding of the system. A deep knowledge of your system helps you strategically define events and their corresponding triggers and thresholds. Events should be created when integral systems or applications start, or fail to start, when prompted.

Your system should be able to tell whether something is working based on the data (or absence of data). Apply system monitoring tools in actionable, intelligent ways that provide more observability into events occurring internally. These events will show your site reliability engineers exactly what’s happening (or likely, not happening) within your system. The more quickly and clearly your SREs receive event information, the more efficiently they’ll be able to handle the event.

Thresholds

Once you’re properly monitoring the events in your system, you need to know what the next action is. This action can come in the form of an alert, a non-urgent notification, a script that should run, or sometimes—doing nothing. Thresholds need to be applied based on the severity of the event and how it can best be resolved. Much of the time, this will require little to no human intervention, but it’s imperative to know when an issue is large enough to escalate to an on-call team.

Alerts

Typically, your alerts will be related to capacity or latency. If you’re approaching capacity on servers, or certain online/offline applications are not responding quickly, it’s likely something should be escalated. It’s best practice to make sure you’re alerting on anything that will be user-visible or can cause major issues or outages elsewhere in the system. If minor low-level components aren’t running quickly, but the overall system or application continues to function effectively, it’s likely not something that needs alerting. Assessing which alerts are actionable alerts is essential for any DevOps team to properly handle incidents.

Anomaly Detection and Chaos Engineering

Ever hear the term, “If it ain’t broke, don’t fix it”? Well, that doesn’t apply with SRE. The phrase for SREs should be, “If it ain’t broke—let’s break it, fix it, then break it again, then fix it again.” At VictorOps, we implemented chaos engineering in the form of Game Days. These Game Days serve as a strategic way to continue testing our systems to make them stronger and more reliable.

Actively monitoring your system inside and outside of your chaos engineering experiments gives your whole team a better understanding of how your system functions. In addition to monitoring your chaos, you’ll need to contain it as much as possible to prevent any downtime or customer affects. Performing chaos events in a staging or testing environment will hedge your risk of incurring an actual incident while you continue to bolster your system. While scary, these Game Days will greatly help your team prepare for real-deal system failure situations.

You Can’t Fear What You Already Know

Your incident management stack can be complicated. The infrastructure necessary to effectively monitor, acknowledge, and resolve incidents varies widely from company to company. Chaos engineering will also look slightly different between organizations. But, this basic incident logic and preparation, mixed with a proactive approach to SRE, can be applied in any system. It boils down to understanding your potential pain points, measuring and testing the efficacy of your application or system as a whole, actively monitoring and alerting, and running chaos engineering experiments to make your system more reliable.

Sign up for VictorOps with a free 14-day trial to see for yourself how we make incident management easier and help you build the future faster!

Ready to get started?

Let us help you make on-call suck less.