VictorOps' First Leap into Chaos Engineering

VictorOps-First-Leap-Into-Chaos-Engineering-Tactics-Banner

We decided to embark on a journey to make our systems more reliable by creating intentional chaos. Our team developed the SRE Council, made up of engineers from different areas of the company, who would be tasked with creating chaos and improving the reliability of our services.

To read the complete story, you can download our new SRE eBook, “Build the Resilient Future Faster: Creating a Culture of Reliability”.

It was time. The SRE Council’s first big initiative was to cover our product with Black Box Alerts to achieve a proactive stance that would defend our customer’s experience. With the plan built, our engineering teams executed the plan and meticulously prepared our monitoring tools, metrics, dashboards, alerts, and thresholds. The Council’s second initiative was to integrate Chaos Engineering into the culture of VictorOps. We utilized a chaos event as an avenue to prove out our first initiative—product-wide Black Box Alerts.

Excitement was in the air as the entire company looked forward to performing and observing this foray into Chaos Engineering. The company had not only been informed of the event but educated on chaos, provided with reasonable expectations, offered Q&A time with the SRE council, and provided with communication methods for the event. As expected, our first chaos event brought valuable learnings about our system and alerts, as well as how to better perform chaos experimentation.

Let’s take a look at how it unfolded and what we’d like to change.

We Fear What We Do Not Understand

We realized the inherent risk chaos events embody could breed fear and hesitation. So, in order to shed fear and hesitation, we approached our first chaos event with over-communication and education. We approached the entire company five times and held a one-on-one with the leadership team. You’ll see our timeline below for more details on how we pulled this off. In the end, this proved effective and set the stage for the big day to be well supported and understood.

Leadership Buy-In, or Lack Thereof

Every step of the journey, the VictorOps leadership team provided thoughtful feedback, suggestions, encouragement, and best of all, excitement! Why? It was their desire to position our engineering team—no, our company, in a proactive stance that prepared us to move value into customer’s hands faster while ensuring reliability isn’t sacrificed. This top-down support of SRE initiatives empowered and accelerated every move we made.

If you have leadership buy-in, don’t assume this means your entire organization is on board. You must still aim to achieve buy-in from the affected teams and individuals, or you’ll be risking support not only for your chaos initiative but potentially for future efforts of your team.

No buy-in from your leadership team? You have two options: achieve it or go grassroots. Having worked in both large-scale corporate environments and a late-stage startup, both approaches to getting buy-in have merit and carry their own unique challenges. It’s always a respectable move to give your leadership team an opportunity to back and support positive initiatives that should benefit them and the company. However, if it’s not achieved, grassroots efforts have the benefit of ironing out the kinks before pitching a well-formed and proven idea to leadership for expansion of the practice.

Blast Radius, Disaster Prevention, and Disaster Preparedness

Limiting the blast radius for a first-time chaos event is certainly a recommended practice and it paid off in spades as our first chaos day came and went with zero production incidents. Is this guaranteed? No, it isn’t, so don’t promise this to anyone. Be intentional about reducing the blast radius and then honest about the remaining risks. We’re taking an iterative approach to chaos by starting small and safe, and moving towards a system that runs under constant chaos in production. So, one of the main items that led to a minimized blast radius was performing these experiments in our staging environment. Next, we removed high-risk experiments that could either affect our production system or our customers—like sending notifications to them from our staging environment.

With that accomplished, we consulted our infrastructure and leadership teams to get their perspectives and suggestions for staying safe. This provided us with excellent advice for disaster prevention, leading us to the following items:

  • Provide company visibility to what is happening via a dedicated Slack channel and Grafana for system monitoring.
  • Assure alternative staging environment to validate any urgent production changes
  • Add 2-factor authentication to production servers as an “are you sure?” checkpoint
  • Define criteria for backing out of experiments
  • Define criteria for resetting state/data

However, prevention is a goal—not a guarantee. So, we were very transparent that disaster was still possible. With that, we made sure to not only focus on disaster prevention but to also assure we accounted for handling a disaster if it were to occur. We arranged a dedicated chaos incident commander for the day and communicated that to the entire engineering team as well as our support team—the ones responsible for customer communication during a production outage. With these protections in place, we’ve now built out a sort of framework for performing chaos experiments that we can reuse as we move forward.

Authentic Reality TV?

There are certainly multiple ways to cause the failure of any given service or feature. Although we aimed for authenticity in how we caused failure, in some cases it took performing the experiment to realize that it was about as authentic as a reality TV show—and thus provided little value. You can’t get everything right on the first try and, sometimes, it’s simply best to learn via experience. Because of this, we’ve re-evaluated those experiments and realized that we not only needed a new approach, we also needed improved tooling. For example, reproducing a DB failure for a service by dropping its network or storage is one thing but causing specific errors on a WebSocket connection is entirely different.

Order vs Chaos

Our last notable learning shines a spotlight on what may seem obvious at first and becomes even more noticeable the day of—the importance of the organization of people during your experimentation. As you’ll see below, we initially lined out five roles. You’ll probably notice that the scribe had a lot of responsibilities—and you’d be right. You may also notice that audience participation was completely nebulous and provided no structure or real responsibility. Needless to say, the scribe role is being split up and the audience participation is being transformed into a, more methodical, live QA role with defined responsibilities and methods for recording the findings.

Dowload the Ebook Today!

Other Learnings

Of course, we also learned A LOT about our systems and our Black Box Alerting. The first chaos we experienced that day was with a problem surfacing in our staging environment before any experimentation had even started! Once resolved, the day progressed as expected, albeit slightly behind schedule. Teams performing experiments experienced a lot of success, as well as some extremely important lessons:

  • Alerts didn’t fire - Icinga alerts were misconfigured because they were misunderstood
  • Alerts partially fired - New Relic “incident preference” led to unintended aggregation of alerts
  • Alerts were less valuable than expected - were unclear and caused confusion
  • Unexpected office DNS caching threw experiments off
  • Thresholds were impossible to reach due to orthogonal keep-alive logic
  • Thresholds were difficult to reach in staging environment or within experiment time-box
  • Alerts were missing runbooks
  • Identified new valuable alerts
  • Flawed detection techniques
  • Identified improved ways to reproduce failure
  • Realization of need to test alerts that are rarely triggered or require extraordinary conditions
  • Need for improved tooling
  • Some experiments caused multiple alerts to fire—unintentionally
  • Experiments can/should be scripted

If you’re still interested in the full, detailed story—download our eBook today!

Wrap-up

In summation, here are some key facts and lessons from the day:

  • Experimented with 40 alerts…11 of which failed to fire or couldn’t be reproduced
  • 14 experiments were unable to be performed due to holes in testing plans
  • 2 alerts fixed during experimentation, 3 more within 48 hours
  • 6 mini-retrospectives were performed, identifying action items to address findings




The VictorOps Chaos Day Communication Schedule:

4 months prior

  • Mention: Chaos is coming…
    • What: Foreshadowing SREs long-term goals in our Product Increment presentation.
    • Audience: Entire company

2 months prior

  • Mention: Chaos is coming…
    • What: A promise of SREs near-term goals in our Product Increment presentation.
    • Audience: Entire company
  • Presentation: Chaos 101
    • What: Education on the principles of Chaos Engineering
    • Audience: Entire company

8 days prior

  • Email: VictorOps’ first Chaos Event!
    • What: Informational email with a recap of chaos engineering and experimentation, links to resources, clearly stated expectations, and the upcoming schedule
    • Audience: Entire company

3 days prior

  • Presentation: SRE & Chaos Engineering
    • What: Describing the motivations behind Chaos Engineering, discussing
    • Audience: Leadership team

1 day prior

  • Q&A Session: Chaos Day Q&A
    • What: Opportunity for anyone to ask questions & present concerns
    • Audience: Anyone

Day of

  • Slack: Central point of communication for the day
    • What: Running tab of ongoing experiments including all activity and post experiment mini-retros
    • Audience: Anyone

Proposed Roles for Chaos Experiments:

Scribe

  • Assure hypothesis and risk assessment have been created
  • Record how the experiment unfolds
  • Collect data (graphs, alerts, times, etc) while experiment is performed
    • Time to know (from moment we trigger → monitoring has identified)
    • Time to detection (trigger → time VictorOps has notified us)
  • Note whether or not the Black Box Alerts were triggered
  • Gather information from mini-retro after test

Driver

  • Perform experiment
  • Provide all history of actions performed (command line, Jenkins jobs, puppet modifications, etc.)
  • Verify alert was triggered

Incident Commander

  • Assure back-out plan is defined
  • Keep an eye on the back-out plan during tests
  • In an incident, perform any communication with Chaos Incident Commander

Chaos Incident Commander

  • Communication point between Ops Support & Incident Commander for team under test
  • Update internal Statuspage & Slack channel

Audience

  • Use the system while test is performed

Don’t forget to sign up for your free trial today! Combine these SRE lessons with the power of VictorOps to help make your system even more reliable.

Ready to get started?

Let us help you make on-call suck less.