It’s common knowledge that many businesses lose money when they have an outage. But how much money exactly? And how do you make the case for better understanding the value of uptime?
With today’s consumers demanding more and expecting continuous delivery, it can be hard to guarantee uptime. However, once you arrive at hard numbers for what your business earns per week / per day / per hour, then you can calculate the cost of upgrades and operational expenses, while also providing insights into the value of your IT organization.
There are a few infamous incidents that show just how costly outages can be for a business that depends on their website to make money. In 1999, eBay had an outage that lasted for 22 hours and cost the company between 3 and 5 million dollars. (And that was 15 years ago!)
Additionally, Amazon recently lost approximately 4.7 million dollars when they experienced a 45-minute outage and when Google is down for 5 minutes, they lose close to $500,000. Those are big numbers for small(ish) outages.
So what if your company isn’t the size of Google? Are you immune to the effects of downtime? The answer, unfortunately, is no. According to Aberdeen Group research, the cost per hour of downtime increased – for business of all sizes – by 38% between 2010 and 2012.
Waiting until the s#*t hits the fan to think about whether you’re effectively monitoring your network is the wrong approach. By that time, the answer is already no, and you’re going to be fighting to get back up, not thinking about how your site went down.
Network monitoring is a necessary evil and one of the most difficult tasks an IT department has to tackle. The good news is that there are simple ways to improve how you’re monitoring your network. Processor.com posted an excellent article with some helpful tips on making this complex job a little easier.
The first tip is to use what you have. This may seem obvious, but there are many companies who don’t fully utilize the capabilities of their network monitoring software…and with good reason. Network monitoring software like Nagios (one of our most popular integrations here at VictorOps) has lots of functionality and can be hard to configure unless you’ve taken a deep dive into the tool. If you haven’t, you may be missing out on some value-add functionality.
In line with the first tip, Processor.com suggests that IT teams take the time to learn. This requires making training a priority and figuring out if you’re getting the relevant information from the tool. Teams need to be asking the right questions to determine if current systems are working.
Jim Rapoza, Aberdeen Group senior research analyst, says many businesses don’t understand when and how problems occur on their networks because they don’t really know what’s normal–many of them have never sat down to determine a network baseline.
The final recommendation for improving your network monitoring is to log it. The only way to start seeing patterns is to collect as much data as possible and begin to use it in a way that allows for teachable moments.
“Aberdeen Group is starting to see big data approaches to network analytics in which tools and processes take all information from entities on the network creating logs, activity, and performance data and use it to create unified intelligence to get a big picture of network performance and activity.”
Statistics collection and graphs are an effective means to start visualizing what’s happening with your infrastructure. Here at VictorOps, those kinds of analytics are not considered nice extras, but things that need to be in place from day one.
DevOps and IT Teams can also speed things up. By integrating an alerts and collaboration platform like VictorOps as part of your network monitoring software solution, alerts, real time problem-solving and a record of what broke and how it was fixed all live in one place. Moving away from tribal knowledge and toward well-documented institutional knowledge is a no-brainer these days, but many IT teams haven’t made time for it yet.
To sum up: network monitoring doesn’t need to be intimidating and it shouldn’t be ignored. Determine your baseline performance, make sure you’re getting the right alerts at the right time and make sure you’re collecting as much relevant data as possible. When your server or website goes down, you’ll have the data, alerts and information you need to collaborate on a solution.
A crisis can blunt your problem solving abilities momentarily, but with the right information and tools to solve your outage, you can be back up in no time and working to create a playbook for how to avoid the potential of future outages. After all: your infrastructure, and the health of your team depend on it.