VictorOps is now Splunk On-Call! Learn More.
Due to the increasing complexity of IT architectures and digital delivery chains, providing an amazing end user experience is becoming increasingly challenging. Whereas traditional IT departments took a siloed approach to monitoring (the application team monitored the application, the network team monitored the network, etc.), that strategy is no longer viable.
Modern websites and applications are now comprised of numerous layers and components, many of which lie completely outside of your control. Third-party services like content delivery networks (CDNs), DNS providers, cloud delivery systems, and external ISP networks present significant hurdles to getting true insight into the end user’s experience, yet an issue with any one of them can impact your customers without you ever even knowing it.
An efficient alerting and incident management tool is a crucial part of any Digital Experience Monitoring (DEM) strategy, but the quality of those alerts is only as good as the data that your monitoring tool provides. For example, if you’re relying on a cloud provider to deliver your application but also running synthetic tests from that same cloud network, you’re going to have significant visibility gaps in all the other layers of the delivery chain.
What’s needed is a DEM tool that provides a complete outside-in view of the end user experience and collects data from every layer of the delivery chain, and from the locations where your users actually are. This way, the mean time to detect (MTTD) can be shortened, and alerts can be routed to the appropriate team or third-party provider before the investigation takes place. After all, waking everybody up in the middle of the night only for them to learn that the problem isn’t their team’s responsibility is a great way to burn out your employees.
Catchpoint recently released the 2019 SRE Report, which revealed that 79% of individuals working in SRE, DevOps, and IT Ops teams experience stress due to incident response on a regular basis. Just imagine how much lower their stress levels would be if they never had to waste hours in IT war rooms trying to figure out who’s actually responsible for remediating an issue.
One of the biggest strains on incident response teams is responding to alerts that are actually triggered by faulty data, or because the testing agent itself experiences an issue.
This can be overcome with a two-pronged approach to monitoring and alerting:
The synthetic testing agents deployed by Catchpoint across hundreds of cities and ISPs around the world are stateless. This means that if there’s an issue with the testing environment, the test will not run or deliver data.
The Catchpoint platform automatically performs debugs such as DIG, DIG+Trace, etc. to take the most exhaustive and mundane tasks out of the hands of the operations team and allow them to investigate the root cause and ultimately fix the issue in a timely manner.
The most obvious alerts are for basic metrics like availability (is the site up or down?) and page load speed (did the page take more than XX seconds to load?). These are tied to what we typically think of as an “outage” (i.e. when there are widespread problems and users are completely unable to access the site or application).
However, we also see significant user experience problems in the form of micro-outages and latency – which often go completely undetected, yet still have a negative effect on your brand or bottom line. Micro-outages occur either when users in a specific geography or on a certain network are experiencing issues, or when they are unable to engage with a specific part of your site (e.g. an ‘Add to Cart’ button is not displayed on an e-commerce product page).
Latency, on the other hand, could mean that everything on the site or app is functioning exactly as it’s supposed to, but delivery issues in the external networks are causing performance issues for the end users.
In cases like this, alerting for the most obvious metrics is likely not enough to provide the kind of user experience that modern customers have come to expect. Your alerts must be tied to more granular metrics that reflect the performance of all the different layers of the delivery chain to ensure that you’re aware when your customer experience – regardless of geography or network – has suffered.
One of the most effective ways to ensure your site or application is performing as you expect it to be is by tracking performance over a long period of time, benchmarking for that performance, and then alerting when it degrades past a certain level.
Trend shifts are a common basis for performance monitoring and alerting, as they notify teams if performance degrades in comparison to historical data, rather than specifying a value for a certain metric.
However, if you’re not careful about what you’re monitoring and what metrics you’ve tied your alerts to, you can miss gradual performance degradations over time, never realizing that there’s a problem until it’s too late and see that your end user’s experience has already suffered severely.
Therefore, in addition to trend shift-based alerts, it’s important to also have some that are tied to hard-and-fast KPIs powered by historical data. This is critical to avoiding the “performance creep” described above, as it provides both the long-term perspective and the threshold(s) that are necessary for detecting those issues – even without a dramatic spike in load times or availability.
This chart shows one of those long-term performance creeps and how it can get away from you. As you can see, this metric nearly doubled over a six-month period, and if it didn’t have an alert tied to it, it ultimately would have gone unnoticed until customers started being affected.
While trustworthy data and advanced alert triggers are important, a complete DEM strategy will also likely include other tools that must all work cohesively together. For example, a separate alerting and incident management tool like VictorOps can be integrated with your DEM platform to disseminate all of the alerts across teams and channels within your organization (e.g. email, SMS, messaging tools, etc.).
Ultimately, your end user experience monitoring solution needs to be optimized to not only detect and fix issues as quickly as possible, but also to communicate them to anyone inside or outside of your organization that needs to be informed. The first step in that process is to have the proper individuals alerted right away with accurate and trustworthy data, at which point they can collaborate with anyone who needs to be informed.
About the Author:
See how a single-pane-of-glass view into monitoring, alerting and collaboration can drive efficient incident response and remediation. Sign up for Death to Downtime – our latest free webinar with Catchpoint to see how DevOps, SRE and IT operations teams are mitigating downtime and driving better customer experiences.