Monitoring: Trending and Visualization is Not an Optional Extra

Mike Meredith - January 11, 2013

Everyone understands the importance of having system and application monitoring in place right away. In the SaaS world, every minute of downtime means lost revenue and angry users. Trending and visualization, on the other hand, can sometimes be seen as an optional extra. Statistics collection and graphs showing things like CPU load or web hits per second displayed over time are frequently regarded as “nice to have”, not “need to have”.

In the rush to get a new platform out the door, many teams decide that trending and visualization can wait. I’ve worked in shops where the pace of getting a new product to market is frantic for months on end. “We’ll get trending up once things calm down a little” is a common refrain. Here’s why you should have trending and visualization running from day one:

Things never “calm down a little”. Especially if your life revolves around responding to threshold alerts. If you don’t know that an important system or platform metric is on its way to an alert state until it happens, the daily interruptions of responding to those alerts will slow your project work to a crawl. If you can’t get your project work done, how will you ever make time to set up a trending and visualization system?

Launch time is a time of critical growth. If it’s a brand new platform, you don’t know how the market is going to respond, or how your customers will wind up using your platform. You and your team are still learning about your software stack, and probably haven’t identified half of your major pressure points yet. With trending and visualization in place you’ll be able to see from day one how your platform responds to load. If additional infrastructure or major architectural changes are needed to support growth, you’ll have enough warning to stay ahead of a load-based catastrophe.

Capacity planning is a moving target. The objective is always changing when it comes to capacity planning. Clearing away one pressure point exposes the next.  You’ve scaled up the load balancers, and now your web servers can’t keep up. You’ve scaled up the web servers, and now your databases can’t keep up. It’s a continuous cycle. The ability to see not just what’s happening right now, but how all the parts of your platform responded to events in the past, is critical to identifying the next pressure point and relieving it before it becomes a problem.

Okay, you’ve decided you want trending in place. Where to start? If your platform is Linux-based, chances are your distribution of choice has several options right in the core repository. Cacti, Ganglia, Munin, and Graphite are all good tools in this space, and worthy of your consideration. There are also add-ons to Nagios or Icinga that can do the job. They each have a different focus and different strengths. I’ve worked with simple web stacks where Cacti does everything I needed, and with more complex clustered environments where Ganglia shines. You’ll need to do your homework and decide which tool best matches your software stack, network architecture, and scale.

It’s worth the effort to get something running as soon as you can. You’ll get value out of it right away. In the long term, it can make you the capacity planning hero that averted a growth crisis.