Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 3 in a series on system visibility, the Detection and Analysis part of the Incident Management Lifecycle. If you missed them, read Part 1 and Part 2 first.
Ok, logging improvements: great for our team and codebase(!!) but there’s a bigger world out there for monitoring and instrumentation. We’re talking about observability here. It’s a world where we not only have the necessary information for troubleshooting and debugging, but also a detailed understanding of how the running system is performing–with alerts for when things go wrong.
Metrics help form a well-rounded approach to the monitoring and instrumentation of your systems. They focus on quantitative measurements of success, responsiveness, demand, saturation, etc. If log statements are for debugging your running system, then metrics are for picturing how the system is running and indicating when the system needs attention. Metrics are the system’s heartbeat.
When I think of instrumentation, I think of a person (the “system”) with advanced hypertension, undergoing angioplasty to open several clogged blood vessels. After a stressful, harrowing experience in critical care, the team performs the surgery. Fortunately it’s successful, but the patient is weak and has a few weeks’ recovery ahead.
But what if preventative health visits and monitoring (system metrics) had detected his high blood pressure earlier? And what if those earlier results were accompanied by a new diet, exercise, and medication plan? Maybe this patient wouldn’t have needed risky, reactive surgery at all.
Obviously this isn’t a direct correlation but it’s a useful analogy to thinking about the visibility that we have into our own running systems. Do we know the request latency and error rate? The demand for and saturation of given components?
Now that we’re completing the big picture of monitoring and instrumentation using multiple methods, let’s examine how metrics complement a good log portfolio.
Logs are strings: Strings need to be interpreted. Interpretation typically requires humans. Humans are slow to respond.
Metrics are numbers: Numbers are immediately actionable. Actions can even be taken automatically.
Metrics have metadata: Metadata describes multiple dimensions. Dimensionality exposes additional trends, outliers, and correlations.
We are still iterating on our metrics infrastructure and portfolio and through that process we are identifying more KPIs (Key Performance Indicators) for our systems. For quite a while we’ve employed black-box style metrics, a.k.a show me what the customer is experiencing.
Now we’re trying to answer deeper questions, like: “Now that I know that error rate is too high, which Service Level Indicators (SLIs) would help paint a more complete picture of my problem?” Our next step is to explore the white-box SLIs that answer these types of internal system questions. Overall, we want a mixture of black and white-box metrics where the former alert off of customer experience issues and the latter point us to a subset of the system for troubleshooting and auto-remediation efforts.
The goal is to auto-remediate issues that can be safely addressed without human involvement and alert on all remaining issues. So, when instrumentation points to a known problem, that problem is auto-remediated by some script or application and people don’t need to get involved in resolving the problem. This gives us response times well within seconds, and no human could both respond to and remediate the issue in a comparable timespan. For the remainder of scenarios, where human intervention is truly necessary, we want to alert on the problem as early as possible and, ideally, even before it becomes a problem.
Alerts and auto-remediation are great, but it’s also helpful to have a visualization of your running system. Well built and intuitive dashboards are great tools for a first responder to troubleshoot and pinpoint a problem by scanning for anomalies. Specifically, we’re talking about dashboards that make anomalies easy to identify without needing a full set of tribal knowledge or a PhD. So, please, be careful what you do with all of that valuable insight into your system. Not all of your metrics must be associated with an alert (or you’ll soon end up with alert fatigue from too much noise). You should aim for intelligent alert thresholds and include them in your dashboards.
Be careful however, as dashboards require a human’s attention in order to be useful…just like logs (at least ones without alerts). It’s both a waste of your time and highly error prone to assume that each person or shift staring at that dashboard will know how to accurately interpret each panel and every time series displayed. This means that care must be taken to not only measure but clearly visualize and indicate the value of your metrics – both via dashboards and alerts.
Ideally, we want to build out preventative health care that’s not just packed full of instrumentation, but is full of useful indicators of system health–the KPIs. If a metric isn’t used by the team through alerts or dashboards, it’s likely not useful. If a metric doesn’t make it to a dashboard or alert for visibility sake, it will likely become a metric that you have to clean up later. These metrics should be removed, and we should be cognizant (but not fearful) of new metrics falling into the same bucket.
Want to kick it up a few notches? Look into systems that can perform anomaly detection for your teams. This allows you to focus on more complex problems, build auto-remediation, fix newly identified issues, etc.
Want to move into uncharted territory? Give sonification a try, or at least check out this cool podcast from Science Friday: Listening in on scientific data. “When it comes to analyzing scientific data, there are the old standbys, plots and graphs. But what if instead of poring over visuals, scientists could listen to their data—and make new discoveries with their ears?”
Although we won’t dive into the topics of distributed tracing and eventing, it’s important to mention how they can add additional viewports into your system. Tracing breaks down the request path through your system by describing both the path and the timing at each stage. Eventing allows you to perform offline analysis of events that occurred in your system, with full detail. This is similar to logs but intended for computational processing and not human processing.
I’ve covered how logs and metrics play huge parts in adding observability to your systems–with tracing and eventing also fitting in to the big picture. Logs provide you with detailed diagnostic information and sometimes a clear enough indication of a problem that you can create alerts off them. Metrics provide you with quantitative insights into the running system and are also an ideal source of alerts. Remember: don’t take on the hard work of adding observability, logs, dashboards, alerts, and anomaly detection to your platform without making these things truly useful or actionable.
Finally, don’t forget to pay attention to those annoying and often below-the-radar gaps in your observability portfolio; like the logging issue we had in our backend systems at VictorOps. You might find that alleviating these annoyances leads to happier, more enthusiastic teams with a renewed stake in creating better systems.