VictorOps is now Splunk On-Call! Learn More.
Application health is at the core of your business. Lost revenue and negative customer experiences come with application or service downtime. Regular application health checks and effective application monitoring will allow you to detect issues before they become full-fledged outages. If you’re not focused on maintaining resilient applications, you’re not focused on the business’ bottom line. DevOps and IT managers need to work together to proactively monitor and test applications and their dependencies from end-to-end, leading to more robust software.
Application health checks need to span across the entire system – from frontend to IT infrastructure. Additionally, you can leverage monitoring in production and throughout the release management pipeline to detect incidents before they affect customers. A combination of manual application health monitoring and automated checks will lead to the most efficient system for incident detection, response and remediation. Not only should you use application health monitoring as a way to identify issues but you should also be building an action plan in case an incident does pop up.
So, let’s take a look at application health monitoring and application health checks to see how they work together and how teams are proactively building more robust systems.
Application health monitoring is the practice of tracking the inputs and outputs of an application based on key metrics, logs and traces in order to watch how an application performs over time.
Application health checks are when you define “healthy” parameters for the monitoring metrics across your application and run regular checks to ensure the system is performing the way it’s expected to.
Naturally, before you can implement thorough application health checks, you need to ensure you’ve built a comprehensive monitoring plan. In addition to application monitoring, you need to effectively monitor IT infrastructure such as servers and networks to track a service’s overall health. Even a “healthy” application could still experience downtime if the supporting infrastructure has its own issues. So, you’ll need to combine application monitoring with IT monitoring to conduct holistic health checks and deepen observability across all systems.
When it comes to application monitoring, there’s no single way to do it. Every DevOps or IT team has built their respective systems differently and, therefore, means they need to approach monitoring and health checks differently. So, best practices in application health monitoring typically refer to common things you should look at when building your own strategy:
Many teams will track latency and apply a simple formula to track overall user satisfaction. Define a goal for “health” when it comes to the length of time it should take for web requests or transactions to execute in your application. Then, measure those on a scale between great (fast), good (okay), bad (slow) and failed. You can then assess the overall success of transactions and requests across your entire application and determine how customers really experience your services.
As part of user satisfaction, you generally just need to track response times across the entire ecosystem. Where are there delays and why would that be happening? Identifying benchmarks for latency and tracking it through the entire architecture will show where there are lags in performance and how it could potentially carry through the system into customer experiences.
How much traffic is going through your system at a given time? Are there predictable times of day where the traffic will spike? How do you scale the application and infrastructure to handle peak traffic? You can track the rate of requests and transactions in your application to understand when you’re getting the most traffic and prioritize where you need to improve scalability and add flexibility.
How often do requests fail? How many logged errors are coming from the application and what’s the number of exceptions being thrown? Tracking error rates over time across disparate services can create a more observable application and allow you to more easily report on overall performance. Identifying frequent errors in production quickly can lead to faster incident response and remediation – fixing problems before customers even notice them.
How often are you spinning up new instances of your application? What are the overall effects on the infrastructure and what can you do to ensure more performant auto-scaling? You can track certain metrics such as number of server instances, CPU usage or ETL and associate auto-scaling rules with those metrics to ensure the system automatically grows as the business grows. Automation in scalability and cloud application health monitoring is about finding the balance between the costs of auto-scaling cloud infrastructure such as AWS or GCP and not harming customer expectations.
As mentioned above, disk usage and CPU usage are commonly tracked in application health monitoring. Then, you can set thresholds around those metrics and run regular application health checks to make sure they’re consistently at healthy levels.
At the end of the day though, customers simply expect the application to work. So, you should track application availability and uptime over time and use that information to determine service level agreements (SLAs) with customers. An SLA can tell a customer how often they can expect the application to be available (e.g. 99.9% of the time). If the service ever drops beneath a given SLA, many companies will offer some sort of credit to the customer. Availability is the end-all-be-all metric to show the overall reliability of your product at a single glance.
With better monitoring, you can begin to run better application health checks. You can determine thresholds based on the metrics and logs you track and automate health checks around those numbers. It’s important to spend time reviewing the importance of individual services and how often health checks should run through the system and report back into a dashboard of some kind. As with most other DevOps disciplines, automation in the application health monitoring and health checks process can create more efficient teams and surface more information faster.
Once you’re running regular application health checks and consistently monitoring the right parts of your applications and infrastructure, you can link up your alerting and on-call incident response workflows. If an application health check fails and a metric surpasses a certain threshold, you can automatically notify an on-call responder to jump onto the issue. And, with more automation, you can automatically serve that alert to the person or team that owns the given service.
Dashboards are essential to application health monitoring and knowing when health checks are failing or succeeding. Visualizations will help on-call engineers quickly digest information, see exactly where problems lie and identify when they first started. Dashboards allow you to extend visibility into disparate services as well as overall system health in a centralized place. So, more people have access to pertinent information and can efficiently escalate issues and collaborate cross-functionally with the proper engineering and IT teams.
Regular application health checks and comprehensive monitoring allow DevOps and IT teams to build more resilient systems. You can proactively detect issues in production and report them to the person who can fix the problem. With automation, all of this can be done in seconds – often allowing engineering teams to remediate incidents before they can affect customers. Fewer customer-impacting incidents lead to higher availability, happier customers and more revenue.
Integrate a centralized on-call alerting and incident response tool with your regular application health monitoring systems to fix issues even faster. Sign up for a 14-day free trial of VictorOps or request a demo to learn how we make on-call suck less and how we can help you make the most of your application health checks.