World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
In the DevOps, IT and SRE world, monitoring and observability are the core to highly reliable applications and infrastructure. The most performant engineering teams are standardizing observability across disparate services and making metrics easier to digest for all teams. With the right information being monitored and surfaced to the right people, sysadmins and developers alike can take action faster.
Setting up effective monitoring and alerting can be a daunting task when you’re starting from scratch. SRE’s four golden signals of monitoring are a great place to start; they include latency, traffic, error rate and saturation as key monitoring metrics for DevOps and IT. But, in this post, we’ll only take a look at one key monitoring metric – latency.
Latency can refer to different processes in networks, servers and applications and can, therefore, be monitored in slightly different ways. But, essentially, latency is used to measure the time it takes for a system’s request to be sent out and a response to be received.
So, how do you measure a healthy baseline of latency? What does it mean for your service’s latency to be “good”? Well, the answer to that question is somewhat dependent on the type of service you’re monitoring. But, luckily for you, DevOps and IT teams have shared some of their own findings about latency and what they’ve learned.
So, let’s get into the nitty-gritty and look at how you can define latency benchmarks and use that information to improve service resilience and incident response.
Along with other important incident management KPIs, latency monitoring can show performance issues as well as outright downtime. It’s a way to ensure your service works quickly and responds appropriately to customer expectations. As with most things in monitoring, latency falls on a gradient scale. So, defined benchmarks for healthy latency across all systems and services are necessary to truly understand how customers see your application when it’s under stress.
Here, we’ll look at some common areas of applications and infrastructure where teams monitor latency and how they use that information to build more reliable systems. Also, we’ll dive into how DevOps and IT teams take action to fix issues faster based on the latency benchmarks they establish.
Latency across the entire application and UI will vary based on the specific experience we’re looking at. Also, as we cover these metrics, keep in mind that these are baselines for nearly any industry but could vary quite a bit based on the type of application or service we’re talking about. For example, anything involving large datasets such as a database visualization or analysis tool might need higher thresholds for latency.
The following latency benchmarks have been studied and established as healthy baselines by the team at Red Hat on this GitHub page.
For most minor actions, the response to user actions should normally be less than 1 second. 0.2 seconds looks immediate and anything over 1 second becomes very noticeable to end-users. Today, it should take no more than 10 seconds (at the very most) for a user to open an application and start using it.
Batch report requests should take less than 30 seconds. Anything that takes longer than 5 seconds should be accompanied by feedback. Users need to know how long they should expect an operation to take to be complete, otherwise, they’ll think something isn’t working. Anything longer than 10 seconds will cause users to start looking for solutions on their own – causing them to close windows or restart applications. These longer delays require specific feedback indicating that a certain process takes longer than most normal operations.
Internal alerts from monitoring tools should have very little latency – you want to make sure you see these alerts quickly. Any alerts being served to end-users should be visible in less than 10 seconds. And, if they take longer, they shouldn’t exceed 60 seconds overall. No matter what the error is, alerts should have some sort of timeout functionality associated with latency in order to make sure customers receive some type of notification in a timely manner.
Network latency involves both the application and the infrastructure. Ensuring a seamless connection between hardware and software takes constant monitoring and action. Anything less than 150 ms is an excellent user experience, anything between 150 ms and 300 ms is good, and anything over 300 ms should be designated as degraded service. Latency between UI and infrastructure should always be less than 50 ms. Higher latency will result in slower but likely less error-prone performance but can be effective for larger deployments or stretched configurations. Keeping network latency at a good level is all about balancing speed and reliability.
Server latency can occur due to the distance between servers and users, modem or router capacity, high levels of traffic, internet bandwidth or the way information is transmitted (e.g. optic fiber, wireless connections, etc.). Tracking latency between users and servers can help you see whether latency is being caused by server-side processes or client-side operations.
Continuous monitoring of the above metrics can show DevOps and IT teams exactly where latency issues lie. You can set benchmarks for healthy levels of latency and attach thresholds to alert on-call responders in case latency surpasses certain benchmarks. This way, you can continuously keep systems in check and fix issues as they come up in real-time.
Latency can fluctuate quickly and often. So, to avoid alert fatigue, it might be smart to only alert responders if latency stays at dangerous levels for a certain period of time. Customers not only expect constant uptime in today’s digital environment, they expect optimal performance and experiences. With less friction, end-users have better experiences with your applications – making them more likely to either purchase your product or suggest it to someone else.
Once an alert is sent out based on reported latency, on-call teams need an efficient way to find applicable information and act on it. Alerts based on latency should have related monitoring data attached directly to them and served automatically to the team owning that area of the application or infrastructure. This way, the on-call responders know exactly what’s wrong and know where to look for remediation instructions.
Once latency has been restored to a healthy level and customers are no longer experiencing performance issues, the team can conduct post-incident reviews. What caused the latency spike? Was it a human error or system error? What kinds of redundancies or failsafe options can be installed to ensure this doesn’t happen again? Do you simply need more server capacity?
Post-incident reviews can help you identify the real root cause(s) of an incident – and it’s likely not one single thing. Latency can build up over time due to human, process and technology inefficiencies. Looking into an incident’s details and the collaboration that ensued during incident response, the team can learn from their mistakes and build more reliable customer experiences in the future.
Healthy benchmarks for latency can result in happier customers and employees. Greater visibility into server, network and application health will result in more reliable services and faster incident response when something does pop up. While latency is only one small part of analyzing a service’s health, it has important ramifications to understanding the way customers experience your service. You can even track average levels of latency over time at different levels of the system to see if you’re constantly improving the performance of your overall applications and infrastructure.
Every organization needs to define benchmarks for latency and communicate those internally across applicable engineering teams. DevOps and IT teams have a vested interest in maintaining uptime and building flexible, resilient systems that stand up to the toughest of tests. Defined latency benchmarks can show the performance of the overall engineering team, add visibility to customer experiences and communicate reliability concerns to stakeholders.
See how you an integrated system for monitoring, alerting and on-call scheduling can help you fix latency issues faster. Try out a 14-day free trial of VictorOps or request a free personalized demo to learn more about detecting incidents faster, getting information into the right people’s hands, building more performant systems and maintaining higher levels of uptime.