VictorOps is now Splunk On-Call! Learn More.
We recently held our first Chaos Day at VictorOps. Although we dove into this day with our eyes wide open, we still came across several unexpected behaviors. You’ll never be able to limit all anomalies from a Chaos Day, and you shouldn’t—chaos is the point. But, there are certain things to consider as your team gears up for your first Chaos Day. It’s important to outline which tests you’ll be running, which areas you’d like to break, and what you’d like to gain from your experience.
Before you start your Chaos Day, each team needs to define the goals of their tests so everyone involved in the experiments (and everyone within your organization) understands why tests are being run.
Have each team define the:
Defining these items will help each team scope out exactly which experiments to run. This will also help each team scope the workload for their day. You don’t want teams committing to six experiments if the Chaos Day window will only allow for two or three. By exposing what you’re testing and what you’re expecting to see, you’ll be able to roughly determine the amount of time each test will take.
To have a successful Chaos Day, you not only need to plan the tests in advance, but you need to have certain tools and processes in place before you get started. During our first Chaos Day, we took advantage of the following:
If your team uses Slack (or some other ChatOps tool), create a Chaos Day channel for note taking so everyone can see what’s happening in real time. Each test should have a dedicated scribe with the sole purpose of dumping information into the Slack channel. This not only helps for people outside of a specific test to see what’s happening, but it’ll also help when you look back on the experiment to see what happened.
Decide before your Chaos Day which monitoring tools you’ll use in order to see the status of your system and figure out how you’ll use the dashboards within each monitoring tool to help remediate chaos faster. Take screenshots of dashboards for future reference, and take note when events happen so you’ll be able to look back into the monitoring tool and quickly find any issues by setting the correct parameters.
Every experiment deserves a retro. To make the most of your retro, be sure a team member is taking notes on how each test behaves while it’s happening. Be sure to ask:
Having these notes will help your team learn from each experiment and will help for future Chaos Days. There’s no point in repeating tests if you get the same results. Be smart with your experimentation and document everything so you’re making progress every time you test.
When holding a broader evaluation of your Chaos Day, consider what happened during each experiment—both predicted and unexpected. And don’t forget to examine learnings from each test—how you can fix the system or process, as well as how you can improve the experiment in the future.
Because you identified and used the right tools during your Chaos Day, you’re situated nicely to evaluate what happened and make positive change for the future. The main things you’ll want to consider when you evaluate your Chaos Day include:
Chaos Days reveal weaknesses within your system and your processes, don’t simply acknowledge them and move on to the next test. Take action items from your day. Perhaps you identified missing runbooks, or maybe you saw a tool one team was using that would help another team. Learning from your Chaos Day is the goal. Learn where your system is weak, find ways to improve it, and learn where your team weaknesses lie so you can make positive change.
Use VictorOps to help improve observability and bolster your chaos engineering operations. Sign up for your free trial today!