In software development and IT, a mindset of continuous iteration and improvement leads to faster delivery of reliable applications and infrastructure. But, part of continuous improvement is continuous testing. And, continuous testing naturally leads to improved monitoring and alerting and teams implementing chaos engineering practices and principles. Active experimentation and testing of systems in both staging and production can lead to deep insights that will help you build more resilient services in the future.
Chaos engineering is being adopted by top DevOps and IT teams at companies such as Netflix and Amazon. To build software with the scalability and flexibility of platforms like Netflix or Amazon, you need to quickly identify problems and take action to remediate issues. Chaos engineering principles can help you proactively find weaknesses in your architecture and remediate incidents before they happen.
Before we dive into the essential chaos engineering principles, we need to define what chaos engineering really is and how the practice first started.
What is chaos engineering?
According to Gremlin, “Chaos engineering is a disciplined approach to identifying failures before they become outages.” By setting up monitoring tools and actively running chaos through your systems in production, you can see in real-time exactly how your service responds to pressure. Chaos engineering practices will look different between every team – but they’re always used as a method for the intentional injection of chaos into your systems. Chaos engineering allows teams to truly learn from the failure and performance of their applications and infrastructure.
The history of chaos engineering
Chaos engineering first became relevant as large internet companies began to implement more complex, cloud-based architecture and distributed systems. The scale of these projects became so large and complicated that teams needed to find new ways to test for failure in their systems. So, teams began to practice chaos engineering.
Most people attribute the beginning of large-scale chaos engineering principles to Netflix’s Chaos Monkey and subsequently, Simian Army. According to Gremlin’s history of chaos engineering, Netflix made the move to AWS cloud-based infrastructure in 2010. Because of this, they needed to inject chaos into their systems to ensure positive streaming experiences in case of any downtime from Amazon servers. Hence, the creation of Chaos Monkey which eventually gave way to the complete suite of failure testing tools included in their Simian Army.
According to Netflix, Chaos Monkey is, “a tool that randomly disables production instances to make sure we can survive this common type of failure without any customer impact.” And, after the success of Chaos Monkey, the team built out the Simian Army – initially made up of Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey and Chaos Gorilla.
Below, sourced from the same article listed above, we list out exactly how Netflix applied chaos engineering principles and used each of these “Monkeys” to actively test the resilience of their infrastructure:
As you might expect, Latency Monkey is used to create artificial delays in Netflix’s RESTful client-server communication layer. The engineering team can use Latency Monkey to simulate service degradation or even complete downtime in order to ensure that upstream services respond appropriately. By creating large delays in client-server communication, the team can test the fault tolerance and resiliency of dependencies for new features and services without making dependencies unavailable to the rest of the system.
Conformity Monkey will automatically find instances in Netflix’s systems that don’t adhere to their engineering team’s best practices. This helps service owners identify issues before they happen – allowing the person or team to remediate the problem and re-launch the application or service properly.
Doctor Monkey leverages the team’s health checks and other monitoring metrics (e.g. storage, CPU load, ETL, etc.) to quickly detect unhealthy instances. Then, Doctor Monkey will automatically remove the instance from service and eventually terminate the instance – but only after the team has had time to conduct a post-incident review and perform root-cause analysis.
Quite simply, Janitor Monkey finds “trash” in Netflix’s cloud environment. The tool finds unused resources and deletes them.
Security Monkey ensures that the team can quickly identify potential security violations of vulnerabilities. The Security Monkey informs the team if any SSL or DRM certificates are coming up for renewal and can ensure that all security measures or certifications are valid and up-to-date. The tool will automatically detect if any AWS security groups are misconfigured and will automatically inform the team and remove instances that are out of compliance.
The 10-18 Monkey will automatically detect any localization/internationalization issues in Netflix’s instances. It will detect configuration or run-time problems for customers across different languages and geographic regions. This way, the Netflix engineering team can ensure service uptime and performance for all of their customers across the globe.
The Chaos Gorilla was level 2 of the Chaos Monkey. Chaos Gorilla was built to simulate the outage of an entire AWS availability zone. The Netflix team needed to ensure their service would efficiently re-balance to functional AWS availability zones with little to no manual intervention or customer impact.
Core principles of chaos engineering
Over the years, teams have conducted their own chaos experiments and implemented their own systems for failure injection. By monitoring the way Netflix’s chaos testing has evolved over time and how we’ve seen other companies successfully apply chaos engineering practices, our understanding of the core principles of chaos engineering has grown.
So, we decided to list out the core principles and philosophies that persist between every team taking advantage of chaos engineering to build confidence in the reliability of their services:
Define your system’s “normal”
Chaos engineering is like the scientific method. Without defining a control group and a variable group, you’ll have nothing to measure against. So, it’s important to first define the “normal” state of your applications and services.
Teams should define the key metrics they need to track and then monitor and measure the output of their system in order to determine what’s indicative of normal behavior. By understanding the metrics indicating when your service is healthy and performant, you can define the metric thresholds that determine when your system is suffering. Every team looking to implement chaos engineering principles needs to understand what their service looks like when it’s functioning properly.
Realistically disrupt your system’s “normal”
Have brainstorm sessions and determine realistic, likely ways that your system could fail. Then, think about you can disrupt your system’s “normal” – whether it’s through synthetic spikes in traffic, intentional killing off of servers, or any other chaos you can think of. It’s important that your experiments and chaos testing tools reflect scenarios which are likely to happen in reality. Then, when you learn how these failures affect the overall system, you can create real change to your processes and technology – leading to more resilient services.
Minimizing the blast radius
Any time you’re testing unknowns in your applications and infrastructure, there’s the likelihood of negative customer impact. It’s the responsibility of the chaos organizer to minimize the blast radius of the tests and ensure the team is prepared for incident response – just in case. As long as the blast radius is contained, these outages and failures can lead to informative insights without drastically harming customers – helping your team build more robust software in the future.
If you’re first diving into chaos testing, it’s great to get started in staging. But, in the end, you’ll want to run chaos experiments in production. You can only truly see how failures and outages will affect your system and customers by applying the principles of chaos to production environments. Because of the risks associated with chaos engineering in production, minimizing the blast radius of your experiments becomes even more important.
What’s so great about Netflix’s Simian Army is that the tools are constantly running random chaos throughout their architecture. Continuous chaos helps the team automatically identify issues and allows the team to spend more time building out new features and services. By automating chaos tests to the same level as your CI/CD pipeline, you’re continuously improving both current systems and future ones. Deeper system knowledge leads to a team that’s able to develop new services with fewer issues. And, when incidents with new releases inevitably pop up, you can detect issues faster and remediate incidents with little to no customer impact.
When it comes to chaos engineering, confidence is key. Do your homework before running the chaos experiment to minimize the blast radius and keep engineers on-hand in case of an emergency. But, at the end of the day, you’ll simply need to unleash the chaos and see what happens. Be confident in your approach to chaos testing and make sure you take detailed notes. Then, you can learn from your mistakes, improve the way you approach chaos engineering and deliver highly performant systems faster.
Benefits of chaos testing
Chaos testing directly addresses unknowns in highly complex systems. While you can usually test the velocity and flexibility of your products through numerous DevOps tools, chaos experiments are the only way to truly see how your system functions during an outage – helping you build deeper subject matter expertise and confidence in your engineering org. Chaos engineering allows teams to scale quickly without sacrificing the underlying reliability of their services.
How VictorOps took on chaos engineering practices
A little over a year ago, VictorOps began to formalize our process for chaos engineering and began conducting our first experiments. When you first dive into chaos engineering, there is a ton of risk involved. Not only do you not know how the system will respond to your chaos experiments, but you really aren’t sure if you’re tracking everything you need to track. It’s hard to prepare for something that is completely unknown to you.
So, for the team’s first chaos experiments, we decided to run them in staging and to remove any experiments that could affect customer environments (e.g. sending alert notifications to customers from the staging environment). While the ideal state is to run chaos through production systems, running your first few chaos experiments in staging is usually a good idea. This will help you tweak monitoring tools, reconfigure your experiments, create better incident response plans and learn more about containing the blast radius.
During the first experiments, the team understood that visibility and collaboration were imperative to success. So, the team decided to document everything in a dedicated Slack channel and a dedicated Grafana dashboard for system monitoring. The team also wanted to ensure nothing affected the production environment. So, two-factor authentication was added to production servers and the team defined criteria for backing out of experiments and criteria for resetting state.
All in all, the first chaos day went fairly smoothly. There were a number of hiccups and failed alerts, but there was no customer impact and we learned a lot about how our system handled failure. Without risk, there can be no reward.
Before taking on chaos engineering, it’s important to get your people, processes and technology to a somewhat stable state. You should have a proactive system for incident detection and response, and you shouldn’t be spending all of your time reacting to incidents in production.
You first need to stabilize your system from real chaos in order to start running intentional chaos. Then, when you’re ready to take on chaos engineering, you’ll have a holistic system for incident response and remediation – helping you visualize the performance of your systems and quickly collaborate when an incident comes up.
VictorOps is a holistic incident response and remediation tool – helping teams collaborate in real-time and get alerts to the right people at the right time. Sign up for a 14-day free trial or request a personalized demo to improve incident management and make the most of your chaos tests.