VictorOps is now Splunk On-Call! Learn More.

Aggregate Monitoring for Full System Visibility

Dan Holloran June 29, 2018

DevOps Events Monitoring & Alerting
Aggregate Monitoring For System Visibility and Incident Management Banner

Aggregate monitoring of your infrastructure is the cornerstone to understanding how your systems behave. Of course, you need to monitor the individual functions of your systems, but compiling the data points over time from all areas of your systems, centralizing them, and visualizing them together gives you the greatest visibility.

You can’t see the upstream and downstream impacts of an event by simply measuring one aspect of your system. Today, due to the nature of interconnected systems and continuous delivery, monitoring solutions need to be constantly re-addressed and enhanced.

Aggregate monitoring refers to the compilation of data from multiple monitoring sources, and time periods, to create a comprehensive view of your system’s operations. In this article, we’ll explore some of the best ways to approach aggregate monitoring, time series data, and the visualization of that information.

The “Why” Behind Aggregate Monitoring

Aggregate monitoring won’t necessarily show you exactly what’s wrong in your system, but it can provide effective indicators that something could be wrong. The idea of aggregate monitoring is to alert engineers when something isn’t “working.” If your service is still actively doing what it’s supposed to, there’s really no need to wake someone up with an alert.

Of course, measuring low-level metrics such as CPU and memory usage have their place, but aggregate monitoring creates a way to functionally know when your service is “working.” From there, you can see when it’s not “working” and begin to dig into lower level metrics to start remediating an incident.

Importance of Time Series Data

Take the following graphs for example:

Combined Cups of Coffee Graphs

If you simply look at the first graph, you can see that this person’s energy level goes up as they have more coffee. But, when you start to look at time series data, you can see that this person drinks more cups of coffee at certain times of the day. So, based on the combination of these datasets, you can optimize the system to maintain this person’s energy level by providing cups of coffee at the proper times of day.

This acts as a very simple example of time series data, it serves to show you the importance of time series data and the granularity it provides. Aggregating your time series data with your high-level metrics can give you more visibility into how your system functions from second-to-second, and minute-to-minute.

Creating a Culture of Reliability

What Data Do You Need to Know?

Prior to the rise of a DevOps approach to continuous deployment, monitoring solutions typically offered only high-level data for events in production. A team’s infrastructure may vary greatly. So, it’s hard to give a one-size-fits-all answer to the question. Each and every team needs to go through an exercise of identifying what metrics are crucial at each stage of development.

After you’ve identified the information you need to collect, assess everything from tooling to data collection to data storage and presentation. Find likely points in your system where failures, latency issues, or other potential incidents could occur. Then, set up monitors to collect data at each potential failure point. You’ll typically measure these metrics over time, then aggregate and organize the data from each of these points into visible dashboards, graphs, tables, charts, etc.

By collecting data over time, engineers get deeper visibility into their system and can better identify when something is not functioning normally. Time-series databases such as Prometheus or InfluxData can help you record data over time and use it to granularly monitor and understand your performance of your application or service.

Visualizing the Data

Measuring aggregate metrics such as mean time to acknowledge (MTTA) and mean time to resolve (MTTR) will give better insights into how well you’re responding to system issues. By simply looking at each incident individually, you wouldn’t be able to identify trends over time. Aggregate monitoring and time series data can track changes over time to show you whether system performance is improving, stagnant, or declining.

The next step is to present the data in a digestible way to your team. Well-built dashboards with pertinent charts, graphs, and tables can make information more visible for your team. Tools such as Grafana, Datadog, Splunk, or Graphite can be used to cleanly showcase your metrics and make raw time series data into something actionable.

Aggregate Monitoring and Incident Resolution

Through real-time, aggregate monitoring your team can better track a system’s health, reducing incident time to acknowledge and time to resolve. Aggregate monitoring metrics can be applied to see the performance of multiple variables, including technology, processes, and people operations. Start utilizing aggregate monitoring and time series data to improve visibility and make incident management easier—and more effective.

Centralize and collaborate around incident data and alerts to quickly remediate issues. See our recent webinar with InfluxData and Grafana to learn more about building a full, end-to-end incident management toolchain.

Let us help you make on-call suck less.

Get Started Now