Find the Latest in DevOps & More in Our Learning Library Start Here.

Using Splunk Dashboards for an Analytical Approach to Incident Response

Dylan Klausing February 19, 2020

DevOps On-Call Monitoring & Alerting Collaboration
Splunk Dashboards for an Analytical Approach to Incident Response Blog Banner Image

“Data-driven.” It’s a somewhat vague term you hear across all industries and disciplines. Oftentimes, teams find themselves with either too much data and no way to analyze it, or not enough data to make accurate decisions. In IT operations and DevOps, data is generated by the people who help build and maintain your systems as well as the systems themselves. Data is all around us in software development and IT Ops, but the question remains, how do we harness it?

Data is driving faster deployments, more reliable CI/CD pipelines, improved on-call quality of life and, of course, better incident response. Observability tools like Splunk and SignalFx are leading to greater insight into system health and helping us detect anomalies and incidents faster and more accurately than ever before. Then, tools like VictorOps allow developers, technical support engineers and IT professionals to collaborate in real-time, drastically improve cross-functional transparency and improve the way we remediate incidents.

Because teams often think of Splunk for log monitoring and general observability, alongside a suite of other APM, network monitoring, container monitoring and system monitoring tools, Splunk’s data visualization and analytics piece can easily get lost in the shuffle. So, we wanted to walk through some benefits of Splunk + VictorOps, not just with real-time monitoring and logging, but also for post-incident reviews and continuous improvement to on-call incident management.

Using Splunk + VictorOps bidirectionally

IT operations teams and DevOps-minded engineering organizations will use Splunk as their log management tool. Splunk allows you to digest machine data from all kinds of sources, searching and filtering in order to find where real issues lie within your logs – leading to faster incident resolution and faster product enhancements. However, if you’re only using Splunk for monitoring, you’re missing a whole second side of the coin. After incident resolution, you can then feed incident response data back into Splunk, helping you track key incident management KPIs, reduce MTTA/MTTR over time, and drive more efficient, humane on-call experiences.

True observability includes the people on your team, the tools you’re using to actually build, deploy and maintain the applications and services you work so hard to keep track of. Thinking of Splunk as a simple log management tool isn’t taking advantage of the full capabilities of the platform. Bidirectional data flow from Splunk into VictorOps and then back into Splunk allows you to monitor the entire incident management lifecycle and find areas for improvement. You can start tracking metrics such as overall downtime, time spent on unplanned work and can even look into specific users to ensure an equitable on-call rotation for everyone involved.

Splunk Dashboard Image With VictorOps On-Call Data

Using dashboards and visualizations for better incident response

While Splunk does an excellent job of collecting the data and aggregating it, the information is useless unless it’s served to engineers in an actionable fashion. Splunk’s data visualization and dashboard functionality will help on-call managers and incident commanders better understand what’s going on with their teams. Based on this data, they can implement new processes, techniques or tools to drive more efficient incident response and hopefully lead to greater transparency and collaboration between teams.

DevOps teams don’t only need to create dashboards and visualizations for high-level incident response metrics in Splunk. Many times, teams can also aggregate the data ingested into VictorOps from additional monitoring tools and send it into Splunk, helping create comprehensive observability dashboards and defining more accurate SLIs, SLOs, and SLAs – especially useful for SRE teams. Now you’ve set up a one-stop-shop for system monitoring metrics as well as detailed incident response metrics, helping you paint the full picture of your service’s uptime and the effectiveness of your on-call teams.

How to Make On-Call Suck Less

A monitoring tool for your people

By feeding your incident response and on-call data back into Splunk, you’ve essentially built a monitoring tool for the people on your team. You can monitor the hours spent on-call for each user and see the number of incidents they’re responding to. Additionally, you can see the criticality of each incident they’re working on and the number of hours being taken away from other work (i.e. writing code, hosting game days, running tests, QA, etc.). At a very granular level, you can see when issues are popping up, where problems are recurring, and which tools and services are firing off frequent alerts – helping you prioritize where you need to look to reduce alert fatigue and fix the biggest reliability concerns.

In this capacity, Splunk truly is the Data-to-Everything Platform. You’re now able to drive faster incident response without creating a culture of on-call burnout. Engineering managers now know when engineers may feel undervalued and can see exact metrics around why their team might be feeling fatigued. Based on who’s looking for what data, you can build different dashboards for different personas – allowing for as much transparency across teams as you’d like. And, with Splunk on top of built-in VictorOps reports like the Incident Frequency Report, you can really narrow down where problems are coming up and use the data to tackle the highest priority initiatives first.

VictorOps Incident Frequency Report Screenshot

Data-driven incident management and response

Data-driven incident management and response isn’t only about using metrics to more accurately detect errors and anomalies in your applications and infrastructure. Understanding why an unstable system might be leading to less productive development and IT teams can also help you establish a more efficient product planning schedule. With Splunk analytics and data visualizations alongside VictorOps, you can reduce white noise, identify when (and why) certain incidents occur, better correlate similar incidents and build a much less reactionary on-call workforce. The data can lead to improved runbooks, alert annotations and post-incident reviews – helping you fix future problems faster.

Along with data collection and visualization, Splunk offers machine learning and AI capabilities in tools such as IT Service Intelligence so you can proactively build more resilient applications and infrastructure while simultaneously facilitating a better on-call lifestyle. The more data you feed into your machine learning algorithms and the engineering teams building your products, the more productivity you’ll gain from the entire system.

Driving continuous improvement

Continuous improvement is at the core of any good, DevOps-oriented organization. Splunk dashboards can help you see which on-call notification methods are most effective and what types of alerts are leading to a lower MTTA – helping you inform your alert routing rules and escalation policies. Through APIs, webhooks and database connections with Splunk, you can basically ingest data from anywhere and create dashboards from that data. Insights provide DevOps and IT teams of any shape or size with an avenue for continuous improvement.

You can mix in data from your CI/CD tools like Jenkins or Puppet as well as comprehensive monitoring data from APM tools, database monitoring and infrastructure monitoring tools like SignalFx, Prometheus and Amazon CloudWatch, respectively. Most teams aren’t currently using just one tool for all of their observability needs – but Splunk makes it easier for engineers to at least digest all of this data in one single location and take action. Over time, you’ll improve not just one part of the software delivery lifecycle, but all aspects of software development, release management and incident response.

Combining the power of Splunk + VictorOps

Creating alert rules based on Splunk data and routing these notifications through VictorOps’ collaborative on-call network is a great start to more efficient incident management and response. But, it’s only one part of a much larger equation. With all the data ingested into VictorOps, you should be able to make the proactive process and tooling changes that can drastically improve the efficiency of on-call teams. Alongside modern advancements in AI and machine learning for IT operations and DevOps, you’ll likely be able to lower MTTA and MTTR so much that it actually becomes negative. That means you can acknowledge and resolve incidents before they even happen…

Separately, Splunk and VictorOps provide a lot of value in their respective arenas of observability and alerting. But, together, when working bidirectionally, you’re able to build out a complete, data-driven system for observability, incident response, agile development and service resilience.

See the tangible benefits of a Splunk + VictorOps on-call lifestyle for yourself. Sign up for a 14-day, free trial or reach out to us for a personalized demo to learn more.

Let us help you make on-call suck less.

Get Started Now