World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
Jonathan Schwietert April 13, 2018DevOps Monitoring & Alerting Post-Incident Review SRE Chaos Engineering
We decided to embark on a journey to make our systems more reliable by creating intentional chaos. Our team developed the SRE Council, made up of engineers from different areas of the company, who would be tasked with creating chaos and improving the reliability of our services.
To read the complete story, you can download our new SRE eBook, “Build the Resilient Future Faster: Creating a Culture of Reliability”.
It was time. The SRE Council’s first big initiative was to cover our product with Black Box Alerts to achieve a proactive stance that would defend our customer’s experience. With the plan built, our engineering teams executed the plan and meticulously prepared our monitoring tools, metrics, dashboards, alerts, and thresholds. The Council’s second initiative was to integrate Chaos Engineering into the culture of VictorOps. We utilized a chaos event as an avenue to prove out our first initiative—product-wide Black Box Alerts.
Excitement was in the air as the entire company looked forward to performing and observing this foray into Chaos Engineering. The company had not only been informed of the event but educated on chaos, provided with reasonable expectations, offered Q&A time with the SRE council, and provided with communication methods for the event. As expected, our first chaos event brought valuable learnings about our system and alerts, as well as how to better perform chaos experimentation.
Let’s take a look at how it unfolded and what we’d like to change.
We realized the inherent risk chaos events embody could breed fear and hesitation. So, in order to shed fear and hesitation, we approached our first chaos event with over-communication and education. We approached the entire company five times and held a one-on-one with the leadership team. You’ll see our timeline below for more details on how we pulled this off. In the end, this proved effective and set the stage for the big day to be well supported and understood.
Every step of the journey, the VictorOps leadership team provided thoughtful feedback, suggestions, encouragement, and best of all, excitement! Why? It was their desire to position our engineering team—no, our company, in a proactive stance that prepared us to move value into customer’s hands faster while ensuring reliability isn’t sacrificed. This top-down support of SRE initiatives empowered and accelerated every move we made.
If you have leadership buy-in, don’t assume this means your entire organization is on board. You must still aim to achieve buy-in from the affected teams and individuals, or you’ll be risking support not only for your chaos initiative but potentially for future efforts of your team.
No buy-in from your leadership team? You have two options: achieve it or go grassroots. Having worked in both large-scale corporate environments and a late-stage startup, both approaches to getting buy-in have merit and carry their own unique challenges. It’s always a respectable move to give your leadership team an opportunity to back and support positive initiatives that should benefit them and the company. However, if it’s not achieved, grassroots efforts have the benefit of ironing out the kinks before pitching a well-formed and proven idea to leadership for expansion of the practice.
Limiting the blast radius for a first-time chaos event is certainly a recommended practice and it paid off in spades as our first chaos day came and went with zero production incidents. Is this guaranteed? No, it isn’t, so don’t promise this to anyone. Be intentional about reducing the blast radius and then honest about the remaining risks. We’re taking an iterative approach to chaos by starting small and safe, and moving towards a system that runs under constant chaos in production. So, one of the main items that led to a minimized blast radius was performing these experiments in our staging environment. Next, we removed high-risk experiments that could either affect our production system or our customers—like sending notifications to them from our staging environment.
With that accomplished, we consulted our infrastructure and leadership teams to get their perspectives and suggestions for staying safe. This provided us with excellent advice for disaster prevention, leading us to the following items:
However, prevention is a goal—not a guarantee. So, we were very transparent that disaster was still possible. With that, we made sure to not only focus on disaster prevention but to also assure we accounted for handling a disaster if it were to occur. We arranged a dedicated chaos incident commander for the day and communicated that to the entire engineering team as well as our support team—the ones responsible for customer communication during a production outage. With these protections in place, we’ve now built out a sort of framework for performing chaos experiments that we can reuse as we move forward.
There are certainly multiple ways to cause the failure of any given service or feature. Although we aimed for authenticity in how we caused failure, in some cases it took performing the experiment to realize that it was about as authentic as a reality TV show—and thus provided little value. You can’t get everything right on the first try and, sometimes, it’s simply best to learn via experience. Because of this, we’ve re-evaluated those experiments and realized that we not only needed a new approach, we also needed improved tooling. For example, reproducing a DB failure for a service by dropping its network or storage is one thing but causing specific errors on a WebSocket connection is entirely different.
Our last notable learning shines a spotlight on what may seem obvious at first and becomes even more noticeable the day of—the importance of the organization of people during your experimentation. As you’ll see below, we initially lined out five roles. You’ll probably notice that the scribe had a lot of responsibilities—and you’d be right. You may also notice that audience participation was completely nebulous and provided no structure or real responsibility. Needless to say, the scribe role is being split up and the audience participation is being transformed into a, more methodical, live QA role with defined responsibilities and methods for recording the findings.
Of course, we also learned A LOT about our systems and our Black Box Alerting. The first chaos we experienced that day was with a problem surfacing in our staging environment before any experimentation had even started! Once resolved, the day progressed as expected, albeit slightly behind schedule. Teams performing experiments experienced a lot of success, as well as some extremely important lessons:
If you’re still interested in the full, detailed story—download our eBook today!
In summation, here are some key facts and lessons from the day:
4 months prior
2 months prior
8 days prior
3 days prior
1 day prior
Chaos Incident Commander
Don’t forget to sign up for your free trial today! Combine these SRE lessons with the power of VictorOps to help make your system even more reliable.