VictorOps is now Splunk On-Call! Learn More.
Systems break, customers get upset, and it costs you money. So, it’s important to know how you’ll communicate during an outage, both externally and internally. In the world of CI/CD and cloud-based services, teams need to be prepared for outages.
DevOps and IT teams should be laser-focused on preparing for failure. Not only should failure preparation be built into your tools, but it also needs to be built into team structures and processes. When crisis strikes, are you prepared to communicate the outage to customers and collaborate internally to quickly remediate the problem?
Being prepared is more than simply creating a toolchain to help you communicate about an outage. In fact, according to our State of On-Call Report, 73% of an incident’s lifespan, on average, is spent in the incident response phase. So, rather than focusing on the tools for communicating during an outage, you should be focused on the process and your people.
Think about your DevOps or IT teams like a military unit. These teams should always be supplied with the tools and training they need to be successful–and be prepared for the worst. By gaining a deep understanding of the worst possible situation, you can back into the tools and processes that help people the most.
When your team is better prepared, and has a deeper system knowledge, they can more easily remediate incidents and make on-call suck less–even if your tools and processes aren’t perfect. Focus on people and the way they’ll perceive an outage first, then refine your tooling and process to make sure everyone gets the information they need when they need it.
The customer comes first. Even if you experience an outage, end users will always appreciate candor and transparency. If you don’t do a good job of outwardly communicating your awareness of an incident and efforts toward a fix, you’ll create a lot of confusion. Doing your best to limit confusion can reduce the noise and help limit the negative side effects of an outage.
Your customer support team is likely being inundated with tickets for one issue because your end users have no idea if you’re even aware of the problem. By simply establishing channels or a status page for communicating an outage can greatly appease customers and ease tension. And in your messaging, be honest about the situation. Don’t overpromise and underdeliver on a solution to the outage. If you tell end users that your system will be up and running in thirty minutes, it better only take you thirty minutes. If you run into other issues and can’t recover state in the given timeframe, you’ll have doubly angered your customers.
Communicating internally during an outage is nearly as important as the external communication. What should the status page say? Conservative estimate, how quickly does the team think it will take to come up with a fix for the outage? What happens if your normal real-time communication methods aren’t currently working (e.g. Slack, SMS, email)?
While some of these questions may seem overly-cautious (I mean–when has SMS ever gone out?), they need to be addressed. First, establish what a “normal” outage communication process will look like. Then, think about ways that your outage tools may experience failure–and potential backup processes or ways to overcome any possible roadblocks.
Also, maybe most importantly, ensure that on-call responders have what they need right at their fingertips. Does your on-call team have the incident context and communication tools they need to immediately start working toward resolving an outage? The best way to figure out the tools and processes that work best for your people is to simply ask them. Figure out what could happen during an outage and build your processes and tooling around that.
The more you can centralize communication with alert information and escalation capabilities, the easier it will be for on-call engineers to quickly respond to an incident. Then, when integrated with external outage communication processes, the team can quickly update customers about any issues at the same time they’re working on them internally. A prepared on-call team will more quickly inform end users of outages, escalate issues to the necessary person/team, and work an outage resolution.
Becoming the most prepared team means you’ve learned from past mistakes. Too many outages can cripple your business and ruin team morale. Every time you experience an outage, your team should be conducting a post-incident review. The post-incident review should cover everything from weaknesses in your tech stack to communication practices that caused the outage.
An outage is not caused by a single root cause. Adding another threshold to your monitoring tools to watch for ETL lag or disk usage spikes can only do so much. If you’re only addressing the technical challenges in your system, you’re treating the symptom–not the core problem. Post-incident reviews should expose ways for your team to improve communication practices–both internally and externally–to help you create robust software and more resilient processes.
Communicate with end users where they live–don’t make them hunt for outage information. Fully transparent communication practices build trust with end users and help your own team resolve incidents faster. Think about ways to communicate an outage, as well as ways to collaborate when you have an outage with your communication. Resilience begins with understanding ways your technology, process, or people can fail–then strengthening your system from there.
Centralize monitoring, alerting, and communication in one integrated on-call incident management solution. Try a VictorOps 14-day free trial and see how our holistic incident management platform makes on-call suck less and gives outage transparency to end users.