Simulators and Validators for SRE (Part One)

Simulators and Validators for SRE and Chaos Engineering Blog Header

Common Gaps in SRE

At its core, SRE is an engineer’s approach to improving operational system reliability via a path that includes, unsurprisingly, even more engineering. Some of the work happens inside of the system being made more reliable, and some happens outside in the form of tools assisting with the process of discovering changes that need to be made.

Two concepts central to SRE are: 1) systems should provide metrics to self-indicate when failure or lack of correctness is occurring, and 2) testing tools should be used to constantly induce a controlled, survivable level of chaos designed to validate and discover gaps in system resilience. These ideas work together hand-in-glove when driving an automated path to success.

As we’ve pursued our own SRE evolution, two specific areas of weakness have also become clear. The first is a metrics gap within a system’s ability to self-indicate overall correctness in relation to input from multiple external sources. The second weakness is a hidden bit of wishful thinking that live production events will exercise enough cases to constitute sufficient coverage.

Correctness of the Whole vs the Parts

Built-in metrics covering the breadth of a system, producing a set of solid green output indicators, certainly adds value to QA efforts and subsequent certainty about production behavior. However, many distributed systems include concepts that are difficult to measure for correctness at the component level. For example, consider these architectural elements:

  • A product with long-lived interactive user sessions, manipulating data through a complex state machine
  • Integrations with multiple external systems that manipulate data through a complex state machine
  • Real-time collaboration between multiple users

They introduce complications by changing the definition of correct end-state to become a summary of expectations from multiple external viewpoints. While it’s often easy for a human observer to step back and judge overall correctness, it may be impossible for any individual component to do this by strictly using information available to it.

In order to have a metric indicating total correctness, it needs to consider the intentions of all external sources. In a production environment, that most likely crosses boundaries leaving the systems being developed and entering the customer’s systems, or even a user’s mind. While neither of those are practical proposals, it’s possible to build a series of integrated simulators that represent a variety of users and systems that provides an automated source for viewing the larger picture of external expectations. A simulator of this nature can only approximate the unpredictability and richness of actual external systems and users, but the benefits are considerable.

Automated External Validation

There are many different flavors of automated testing: unit, integration, end-to-end, stress testing, and chaos testing. Each and every type are important for building reliable systems. Open-source tools already exist to automate and support these activities, and you should apply them rather than spend effort rewriting them.

However, there is another category not often discussed, because the only good choices are bespoke. This category is for tests that spot check key happy-path and corner-cases in a way that overcomes the limitations of internal component level health metrics and generates system-level success metrics from external observations. In order to create a test that honestly validates results delivered to a user, it has to be an end-to-end test operating outside of every layer adding complexity. This includes load balancers and/or any applicable GSLBs.

This type of test is not often suggested because it depends on having a suite of integrated simulators, already providing complete knowledge of the external expectation of system state, as they actively provoke that state. However, once those simulators are in place, this type of test is just a small step further.

Pushing SRE Value Upstream

While SRE at heart lives and breathes in production, that doesn’t preclude it from adding value earlier in the development lifecycle. The tools and techniques improving reliability have equal value when applied to prior phases. For a production setting, a constant trickle of edge-case coverage is enough to implement health-check metrics.

In a staging environment, the ability to drive load to arbitrary intensity, while ensuring quality, provides a crucial ability to automatically vet production readiness in both new code and new environments. In support of QA, many subtle bugs can only be reproduced by simulating the severe loads seen in production.

SRE is a Behavior, not a Role

An SRE-Focused Simulator/Validator

The remainder of this series explores the design and benefits of bespoke testing tools that target a system in a way only possible through understanding the underlying state machine and user expectations. The basic goal of these tools is to simulate an arbitrary number of simultaneous users driving realistic, interactive sessions alongside simulated external systems producing event streams. All of this work is to be done while measuring the subsequent correctness of the results.


In summary, the testing tools being proposed sit outside of a target application, and are driven by a test configuration to operate the external interfaces in concert.

Screen Shot 2018-08-06 at 2.20.43 PM

Success Factors

The design and implementation of this testing tool is guided by some very specific criteria, intended to limit the burden of generating more code and increase the effectiveness of the end result.

Simple and Isolated
  • Keep it simple. Every line of code, even if it lives inside a test, adds more risk and more maintenance costs.
  • Try not to use any production code in the testing tool. Refactoring production code should never affect the testing tool unless it changes an external-facing contract. The cost of duplication is worth the safety and isolation.
  • Ideally, the tool touches only external-facing APIs. This provides the slowest possible rate of change in the contract between the test and the system being tested. If it seems like the only way to validate a change is to query an internal database, consider adding an internal API, or look for a type of API-driven user interaction that reveals the results. The greatest benefit will be delivered if this tool can safely operate outside of firewalls.
Complete Tests
  • Have a deep understanding of the correctness of produced results and ensure the test is able to describe the deviations in a way that assists troubleshooting.
    • Some of these results are inside the underlying systems being tested
      • Black box metrics testing the viability of each system
      • System health metrics (disk/cpu/threads, etc.)
    • Some results are exclusive to the testing tool
      • Results delivered to the user in-session interpreted for key points of correctness
  • Operate the product just like a user
    • Generate load from lots of users at a baseline intensity with random spikes
    • Login flow / long-lived user sessions
    • Long-lived intense lists of actions
    • Intentionally “extreme” actions that are either highly repetitive or operate on large quantities of data.
  • Simulate external event sources. This tool represents a rare opportunity to coordinate simulated streams from external sources along with user expectations. By having full knowledge of the sum of all external and internal events, the test can apply a more detailed validation that is normally not possible due to the nondeterministic aspect of external event streams.
  • Works with metrics systems
    • Examine metrics coming out of the product and compare performance to a known baseline. Perhaps there’s 100% correctness, but you have double the threads or a memory growth curve that is unsurvivable. General aberration detection within the metric system should handle extreme cases, but this tool will be aware of the exact quantity of simulated users, in conjunction with metric deviations, and can apply a formula to predict scalability concerns.
    • Produce metrics describing product performance as measured from an outside tool. This allows an effective client-observed metric accounting for the total external experience across server restarts, etc. If simulated user actions were performed without driving an actual browser process, the metrics will represent something close to maximum backend performance without rendering delays, but measured in a way that is nearly identical to what a web or mobile client would see.
  • Adjustable workflow
    • Variety of user session behavior patterns
    • Variety of external system event patterns
  • Adjustable intensity
    • Rate, duration, and concurrency of user simulator
    • Rate, duration, and concurrency of external event simulator

Watch for Part Two

Stay tuned for part two to learn more details about using simulators and validators for end-to-end testing and SRE.

VictorOps incident management can help organize your stress tests by integrating with your monitoring, alerting, and chat tools. Sign up for your own 14-day free trial to start leveraging the power of observability and collaboration!

Ready to get started?

Let us help you make on-call suck less.