VictorOps is now Splunk On-Call! Learn More.
This is a guest article by Freyja Spaven from Raygun – an error, crash, and performance monitoring tool for web and mobile applications. The Raygun team are experts in comprehensive application monitoring and surfacing actionable incident insights.
You spent months redesigning your on-call schedule, researched best practices, and looked at strategies from the most prominent tech leaders. But, your MTTR (mean time to resolve) is still too long, and the pressure is on to shave off minutes — even hours.
When you ask your on-call developers where the biggest time-drain is, you find that a majority of the resolution process is spent looking through log files for the root cause. To see the biggest short-term reductions in response time, you start looking for solutions.
Let’s look at an overview of an incident response plan and how you can leverage insights from error monitoring tools to respond to issues faster.
With so many tools available around incident detection, it can be hard to know which will cause data overload and which will give actionable metrics. You can quantify whether your system is healthy or not by setting up priority guidelines, which you can then track with monitoring tools. For example, you can say, “We have a server error that has resulted in a critical outage, which we need to resolve immediately.” Or, “A redundant system fell over and failed to recover but it has no visible impact on customers so let’s wait to alert someone until office hours.”
Monitoring tools put metrics around these priorities and give visibility into the severity of the issue, so you can set up meaningful alerts that are actionable and triaged, rather than just noise.
If you have error monitoring integrated with alerting software like VictorOps, alerts are raised to VictorOps based on configured thresholds for new errors or recurring errors (e.g. an error which was previously ignored occurring again). You can even specify the number of errors over a given time, which raises an alert in each case.
The next place to look while trying to save time is the incident response phase. Firstly, are you getting the right people involved quickly? As Dan Holloran describes, the key to improving incident response time is a deeper level of transparency and collaboration across both humans and technology, which includes keeping affected customers in the loop about outages.
There’s no need to risk a wider loss in confidence in your community when something happens.
When experiencing an outage, people affected often head straight to social media in an attempt to avoid support queues. This is often a problem for software teams, as team members who monitor social media channels still need to find answers with internal teams before they can respond, yet customers still want to cram detailed support requests into 140 characters.
Error and crash reporting software monitors affected users for specific errors and crashes or poor user experiences–allowing you (or your support team) only to contact users that have been affected by a bug or underlying issue.
The remediation phase is often where teams can save the most time, especially when you are relying on error logs. The first problem with this is you usually need someone skilled in log searching to be on-call so they can narrow down the error quickly. The second is that with logs, it’s hard to tell the severity of the issue without detailed cross-examination. Not ideal at 3 AM!
Fixing the issue is easy if you know where it is, so many dev teams make the decision to allow an error monitoring tool do the heavy lifting in identifying the root cause of an issue, so it’s surfaced inside the alert itself. From there, it’s much easier for the developer to investigate further. They might want to understand how many people the issue affected, or if there have been any recent deployments that caused the issue.
If you’ve set up error monitoring well, a developer can quickly see these metrics and understand the context. Any description attached will be enough to understand what’s going on.
Jason Hand, the ex-DevOps Evangelist at VictorOps and author of the O’Reilly ebook Post-Incident Reviews, places a particular focus on “blameless culture” in successful on-call teams. He reiterates in this article that removing blame gets to the heart of issues much faster, and as a result, nobody withholds information.
Error monitoring software can speed up the incident analysis further by providing an organized timeline of what happened and what actions were taken to resolve the issue to make improvements. The timeline is automatically logged in integrations with Slack or HipChat, helping your team to shave off vital minutes if the issue arises again.
Don’t set monitoring tools and forget them—this might lead to a reactive response. Monitoring errors in production proactively is one of the biggest benefits of error monitoring software and allows you to be better prepared.
For example, by using the alert history from the incident analysis phase, your team can spot trends before an alert is triggered. Going back and looking for trends in errors helps you to understand previous behaviors triggering errors, which in turn, speeds up the discovery of the root cause.
You can save a lot of time in your on-call process by removing the need for developers to search through error logs. For a more efficient alerting process, shorten the time spent at each of the phases in the incident management lifecycle.
Removing ambiguity in the discovery phase is one of the fastest ways to remove confusion and noise from errors. If you are spending too long in the discovery phase, consider supporting your error logs with a more sophisticated crash reporting tool, so your developers can get on with what they love to do most—write code and ship features.
Raygun helps DevOps and IT teams identify application errors and performance issues faster. Learn more by checking out their free eBook, Actionable APM: Your Guide to Modern Application Performance Monitoring, to make the most of your own monitoring and alerting stack.
Freyja Spaven is a digital marketing specialist at Raygun, a software intelligence platform that helps development teams create error-free experiences for users. Freyja has authored many articles on software quality and performance, driven by a passion for sharing knowledge and best practices to enable the success of others.