VictorOps is now Splunk On-Call! Learn More.
We just released the initial version of our reporting feature. VictorOps is currently offering 3 different reports: Incident Metrics, MTTA/MTTR, and Incident Trends. All the data is related to incidents, acks and recovery, of your organization.
We start with Incident Metrics (IM) as the most straightforward report, presenting daily totals of incidents, acknowledgments, and resolves (IARs). Next, MTTA/MTTR (MM) looks at the daily average acknowledgment and resolve rate. Last but not least, Incident Trends (IT) graphs a running average of IARs.
Let’s take a 30,000 foot view of these reports:
IM shows the short-term trends for IARs, while give a quick glance at the load on your team(s).
IT is a 15 day average of IM, giving us a smoother graph that shows long-term trends.
MM, on the other hand is the daily average of the time between start and ack of an incident, and time between start and resolve of an incident.
MM, shows trends (both long and short) for average response and resolution times for incidents.
From this view, let’s get a bit deeper and look at how we calculate the values in the second and third reports. For IT, we sum each set of totals for the day plus those totals for the 14 days prior, then divide by 15 (days in moving average). MM is a bit more complex, we take the difference between start time and acknowledge or resolve time for each incident that occurred that day, sum each set together and divide by the total number of incidents that we summed for each set.
The formulas for each…
where t(x) is total number of incidents on day x. Negative values indicate past days.
where A(x) is the Ack time of incident x, S(x) is the start time of incident x, and d is total number of incidents in a day.
where R(x) is the Resolve time of incident x, S(x) is the start time of incident x, and d is total number of incidents in a day.
Ok, so now we understand what the reports are showing, but what do they mean? How can we use them?
Let’s start with the first report. It is a chart showing a bare-bones view of how many times a day your teams are getting alerts, how often they are acknowledging those alerts, and how frequently your teams are resolving alerts. This set of data is useful in showing the load on your teams, and how they handle that load.
If your system sends very few alerts, but they take a while to fix, your system could be healthy throwing random errors (no system is perfect), or you could have unknown issues that are invisible until they crop up. By itself, this report won’t tell you what your system is like - that requires your individual circumstances to make the final deliberation.
Onwards to the IT! Much the same interpretation can be seen in this report, however, the length of time differs. The meaning here is that while you can see load on your engineers in the short-term with IM, IT shows how that load is moving, and affecting your employees over much larger time frames. How has the last quarter looked? Look at IT, not IM. Just as with IM, IT isn’t the panacea of system health metrics (far from it in reality), but used with domain knowledge, it can shed light on the fragility of your system.
TTA and TTR, also aren’t the remedy to system fragility. What these two metrics attempt to illustrate is how quickly your engineers respond to system failures. When agglomerated, they can be used to estimate the speed that your systems can be brought back up. Due to how we are currently calculating the values, MTTA and MTTR can be misleading, if an incident is rerouted through a number of teams, the total time until the last ack comes in is used.
When an incident is resolved, it uses the total time until either some one hits the resolve button, or until the system sends a recovery message to our platform. This could mean that resolves could appear to take longer and skew the metric up. Or the system could be auto-resolving, skewing the system down. Overall, these metrics require your systems’ domain knowledge to know whether or not it is healthy and functioning properly, but they help immensely with that interpretation.
One final point of contention is the logarithmic axis within each report. We chose to use a log scale y axis, due to the variation in each organizations incident/alert volume. Challengingly, the distance between each tick mark is equivalent, but the value goes up exponentially. This means that between 1 and 10 there is the same screen space as between 10 and 100, or 100 and 1000. To help ease into the log scale, we have hover tooltips, that show the exact values, making the reports easier to read.
We decided on these reports as our initial offering because they illustrate a selection of understandable metrics that are meaningful when used in conjunction with each other and with other metrics within your systems. We also felt that these were the most relevant to what our customers needed and wanted from an initial pass at reporting.
Going forward we are looking into more granular reports, such as team or individual level MTTA/MTTR, IARs (both base totals and overall trends). In addition, other areas that are getting attention are system metrics, such as noisiest server (most incidents sent), and MTTA/MTTR for those noisy servers in order to better understand how efficient systems are running and where excess noise in a organizations’ systems is coming from.