VictorOps is now Splunk On-Call! Learn More.
Outages in IT and DevOps are inevitable. Even the most well-defined processes aligned with the best infrastructure resources are bound to face outages. Some outages are planned for scheduled maintenance and are easier to handle and communicate. Others may occur abruptly and put your business and operations at risk – costing you vast amounts of time and money.
In a previous analysis of the costs of downtime, we found a Rand Group article stating that 98% of organizations said a single hour of downtime costs over $100,000. And, even more, 33% of enterprises reported that one hour of downtime costs between $1-5 million.
The outage recovery starts with contacting multiple people, assigning multiple tasks and allocating available resources to the recovery teams. This process can take hours if the team isn’t prepared. Any amount of delay puts you at risk of losing customers to the competition. Processes that depend on outage-affected resources may also become obsolete. So, with a pre-defined and automated communication plan, you can save substantially on recovery time and cost.
We put this guide together to help you improve processes, maintain brand reputation and ensure positive customer experiences with a foolproof IT outage communication plan.
Communicating with the internal DevOps, IT teams and stakeholders about an outage forms the base of your communication plan. Here are some best practices to follow for facilitating the outage information flow to your internal teams:
You can use alerting tools to pre-configure notifications and send them to the team that can identify, analyze and act on an outage in order of priority. You can also use these tools to provide multi-channel targeting via email, SMS or telephone, thereby reducing your inbound call volume and associated costs during an outage. Tools like Slack or Microsoft Teams are widely used by enterprises for instant notifications, quick hyperlinks to data, and other updates during an outage.
Mobile access to alert data and communication tools will enable rapid incident response. Smartphones are easily accessible at the time of emergencies when you can’t immediately reach for your laptop. An incident management mobile app can enable DevOps and IT teams to quickly act and collaborate around alert data and identify the cause of an outage. You can leverage chat and alert routing functionality in the mobile app to loop-in other people and teams to quickly jump on an issue.
Automated escalations are a wise way to manage on-call rotations and incoming alerts. Automated escalations let you notify the next person in an on-call rotation in case someone is unavailable, ensuring consistent on-call coverage. This avoids unnecessary service interruptions and haphazard decision-making when immediate action is required during an unplanned outage.
Give your stakeholders accurate information about the outage. What type of outage is it? How long has the outage persisted? When you give out the right data in real-time to business teams and other stakeholders, you’ll help others across the organization take action to address the outage from all angles. You can quickly inform your colleagues that your team is busy working the outage and that other tasks may be de-prioritized or reassigned.
You might not have the time to convey the same updates over and over to the stakeholders. Release company-wide updates or use something like StatusPage to continuously inform both internal and external stakeholders during an outage. You can also use many monitoring tools to create custom dashboards displaying information like the number of open incidents, their severity, and contact information for on-call engineers.
Create a list of engineers who have a role in incident resolution. Inform them about their roles and make sure they’re available during the outage. To create central control on communications, appoint an incident commander who can ensure an uninterrupted flow of information during an outage.
Being proactive in communicating an outage to your customers helps you gain their trust and mitigate any impact to brand reputation. Here’s how an IT outage communication plan can help you communicate with your customers:
Let your customers know you’re working on the issue. You can convey the information through StatusPage, another maintenance page on your website, email updates or social media updates – or a combination of everything. Also, let them know about the affected versions and dependencies, any workarounds, estimated time of recovery and details of any hotfixes/patches. Accurate, real-time communication is always helpful for reassuring your customers during an outage.
Provide regular updates and detailed information to your customers so they can prioritize their upcoming tasks. Communicate in a single voice and format – this makes for clearer messaging and builds customer trust, even when you’re handling an outage.
The customer support team must always be ready to address urgent customer queries whenever an IT outage occurs. Your support team can actively update the status page and other communication channels in real-time, reaching out to customers with pertinent information during and after the outage.
An IT outage communication plan should serve as a step-by-step guide to efficient outage resolution. Any delay in resolving an outage will cause lost revenue and put you at risk of losing customers to the competition. A well-defined alerting and monitoring strategy, alongside a real-time communication plan, will help you avoid unnecessary workflows and ensures outages are resolved faster – resulting in less customer impact.
So, crafting a robust IT outage communication plan becomes as important for organizations as real-time outage response. Creating a surefire communication plan before the next big incident will save you from customer complaints, lost revenue and negative brand reputation.
See how a collaborative tool with a centralized view into IT alerting and on-call scheduling can drive rapid incident response and remediation. Sign up for a 14-day free trial of VictorOps or request a demo to make the most of your outage communication plan and make on-call suck less for your DevOps or IT team.