At its core, SRE is an engineer’s approach to improving operational system reliability via a path that includes, unsurprisingly, even more engineering. Some of the work happens inside of the system being made more reliable, and some happens outside in the form of tools assisting with the process of discovering changes that need to be made.
Two concepts central to SRE are: 1) systems should provide metrics to self-indicate when failure or lack of correctness is occurring, and 2) testing tools should be used to constantly induce a controlled, survivable level of chaos designed to validate and discover gaps in system resilience. These ideas work together hand-in-glove when driving an automated path to success.
As we’ve pursued our own SRE evolution, two specific areas of weakness have also become clear. The first is a metrics gap within a system’s ability to self-indicate overall correctness in relation to input from multiple external sources. The second weakness is a hidden bit of wishful thinking that live production events will exercise enough cases to constitute sufficient coverage.
Built-in metrics covering the breadth of a system, producing a set of solid green output indicators, certainly adds value to QA efforts and subsequent certainty about production behavior. However, many distributed systems include concepts that are difficult to measure for correctness at the component level. For example, consider these architectural elements:
They introduce complications by changing the definition of correct end-state to become a summary of expectations from multiple external viewpoints. While it’s often easy for a human observer to step back and judge overall correctness, it may be impossible for any individual component to do this by strictly using information available to it.
In order to have a metric indicating total correctness, it needs to consider the intentions of all external sources. In a production environment, that most likely crosses boundaries leaving the systems being developed and entering the customer’s systems, or even a user’s mind. While neither of those are practical proposals, it’s possible to build a series of integrated simulators that represent a variety of users and systems that provides an automated source for viewing the larger picture of external expectations. A simulator of this nature can only approximate the unpredictability and richness of actual external systems and users, but the benefits are considerable.
There are many different flavors of automated testing: unit, integration, end-to-end, stress testing, and chaos testing. Each and every type are important for building reliable systems. Open-source tools already exist to automate and support these activities, and you should apply them rather than spend effort rewriting them.
However, there is another category not often discussed, because the only good choices are bespoke. This category is for tests that spot check key happy-path and corner-cases in a way that overcomes the limitations of internal component level health metrics and generates system-level success metrics from external observations. In order to create a test that honestly validates results delivered to a user, it has to be an end-to-end test operating outside of every layer adding complexity. This includes load balancers and/or any applicable GSLBs.
This type of test is not often suggested because it depends on having a suite of integrated simulators, already providing complete knowledge of the external expectation of system state, as they actively provoke that state. However, once those simulators are in place, this type of test is just a small step further.
While SRE at heart lives and breathes in production, that doesn’t preclude it from adding value earlier in the development lifecycle. The tools and techniques improving reliability have equal value when applied to prior phases. For a production setting, a constant trickle of edge-case coverage is enough to implement health-check metrics.
In a staging environment, the ability to drive load to arbitrary intensity, while ensuring quality, provides a crucial ability to automatically vet production readiness in both new code and new environments. In support of QA, many subtle bugs can only be reproduced by simulating the severe loads seen in production.
The remainder of this series explores the design and benefits of bespoke testing tools that target a system in a way only possible through understanding the underlying state machine and user expectations. The basic goal of these tools is to simulate an arbitrary number of simultaneous users driving realistic, interactive sessions alongside simulated external systems producing event streams. All of this work is to be done while measuring the subsequent correctness of the results.
In summary, the testing tools being proposed sit outside of a target application, and are driven by a test configuration to operate the external interfaces in concert.
The design and implementation of this testing tool is guided by some very specific criteria, intended to limit the burden of generating more code and increase the effectiveness of the end result.
Stay tuned for part two to learn more details about using simulators and validators for end-to-end testing and SRE.
VictorOps incident management can help organize your stress tests by integrating with your monitoring, alerting, and chat tools. Sign up for your own 14-day free trial to start leveraging the power of observability and collaboration!