VictorOps is now Splunk On-Call! Learn More.

The Understated Costs of Downtime


I’m not completely sure everyone knows the real costs of downtime—and it’s a helluva number… In fact, conducted a study showing that Fortune 1000 companies pay between $1.25 billion and $2.5 billion in annual unplanned application downtime costs. That’s not simply one time, that’s one time per year.

As the SaaS landscape of large, interconnected systems grows, you won’t completely avoid downtime—so you need to understand the costs.

Why Are Downtime Costs Overlooked?

Gartner came out with an article asking, “What is your primary criteria when evaluating network solutions?” Surprisingly, only 20% of respondent said “availability.” This begs the question, why do so many companies overlook the costs of downtime?

A deep dive into the real costs of downtime shows why availability and site reliability engineering are highly important components of your business.

Financial Costs of Downtime

In a previous post about root cause analysis, I mentioned the 2016 State of DevOps Report from Puppet and DORA where they published this equation for calculating downtime:

Cost of Downtime = Deployment Frequency x Change Failure Rate x MTTR x Hourly Cost of Outage

Reducing the mean time to resolve an incident has a direct correlation to reducing the cost of downtime. When your teams reduce the overall time of the first three incident lifecycle phases (Detection, Response, Remediation), you’ll vastly cut down incident resolution time and the cost of an outage.

The Rand Group published an article last January stating 98% of organizations said a single hour of downtime costs over $100,000. And even worse, 33% of those enterprises reported one hour of downtime costing between $1 - 5 million.

Downtime costs continue to rise as the use of software applications become more prominent across industries. The same Rand Group study mentioned above goes on to say the average cost of one hour of downtime has risen by 25 to 30 percent since 2008.

If costs of $100,000 per hour don’t scare you, maybe a British Airways incident that stranded thousands of customers will. According to the Forbes article about the incident, the total cost of that single incident was $102.19 million. Wow, that sucks…

And, those numbers don’t include intangible costs such as the negative impact on overall brand reputation, poor customer experience, or engineering time/resources needed to conduct post-incident reviews.

Guide to Post-Incident Reviews

Behavioral Costs of Downtime

In addition to the monetary burden of downtime, you need to look at how service interruptions affect people involved, both customers and employees. ZDNet recently put together a great piece on personal costs of downtime. The article, citing a study from the Washington Post, showed that 6.2 hours are lost every day due to interruptions. That’s 31 hours per week.

If we break down these stats that’s 238 minutes/day lost, on average, due to interruptions, 84 minutes/day on restart time, and an extra 50 minutes/day thanks to stress and fatigue. Even further, a Carnegie Mellon University study showed that cognitive function can decrease by 20 percent after an interruption.

Due to service interruptions, people are inadvertently costing time, wages, and productivity. It’s a vicious cycle—the more downtime you experience, the more fatigued and lethargic your employees become, the more it costs your business.

What Can You Do?

Hefty downtime costs, disgruntled employees, and unsatisfied customers pop up if you don’t focus on availability. Restructure your teams, processes, and tools to build more reliable systems without sacrificing agility.

Create DevOps teams that are equally focused on SRE as they are continuous deployment. Ask yourself, “Who owns our availability?” Fostering a culture of code ownership and accountability from beginning to end, for every person on the team, mitigates downtime and improves overall system knowledge.

You could even try to break things! Learn about your system and make it more robust by adding chaos engineering to your SRE process. As engineers receive more exposure to the system, they become better equipped to remediate incidents quickly.

We understand the costs of downtime. So, we built a collaborative, end-to-end incident management tool to help you maintain consistent service availability. Sign up for a 14-day free trial to see how VictorOps helps you build more reliable systems and applications.

Let us help you make on-call suck less.

Get Started Now