World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
In a world of rapid software delivery and CI/CD (continuous integration and delivery), things break. Servers go down, third-party services fail and new code in production can cause unforeseen incidents. So, effective monitoring and alerting are imperative to maintaining highly reliable services. With context appended to alerts, on-call responders can quickly identify the services that are having issues and get those alerts to the right people. Then, you can continuously improve operations through post-incident reviews – creating a proactive system for monitoring, alerting and incident response.
Armed with the right information, highly collaborative on-call teams can rapidly respond to incidents and fix issues faster. By learning how the system functions over time and improving the way the team works together, you can start proactively addressing reliability concerns and provide useful context to responders more quickly. Then, you can build out an incident response plan that works for nearly any incident that pops up for your DevOps or IT team.
So, we wanted to dive into some monitoring and alerting best practices, how you can add context to alerts, and some methods for proactive incident response.
Because every system is built on a different stack and no two DevOps or IT teams are built the same, there’s no single way to implement a monitoring and alerting strategy. But, there are general monitoring and alerting best practices that teams can look at to ensure a highly reliable release pipeline. So, let’s take a look at some practices that, when adopted by any team, will result in better monitoring, alerting and incident response.
Before tracking any metrics or logs – you need to first determine the information that’s most important to your business. Then, you should prioritize these metrics and categorize them based on internal and external information.
Internal monitoring metrics include things such as throughput, success and error rates, and overall performance of internal systems. These metrics show how technical systems are interacting, their performance, and the overall health of your systems and networks. Internal metrics should show you the availability of your services and whether or not they’re actively doing what they’re supposed to.
External metrics are metrics influenced by external factors (most likely customers or integrated third-party services). Monitoring external metrics like latency, saturation, traffic and errors (SRE’s four golden signals) can help you see a detailed picture of a system’s state before, during and after an incident.
Combined with log monitoring, internal and external metrics can help teams build a fully observable system. With both black-box and white-box monitoring metrics, DevOps and IT teams are able to see exactly where issues pop up – whether they’re in applications, infrastructure or networks. Then, the team needs an incident response plan in case of an emergency to ensure rapid recovery from failures or downtime.
Along with determining what to monitor, teams need to know which services are most important. How many other features are integrated with a certain tool or service? What are the implications if an application, tool or service fails? Determining which aspects of your system are most important to maintaining a reliable product will help you determine how intense your monitoring needs to be. Then, you can start to assess the priority of different alerts and how/when those alerts need to go out to on-call responders.
Knowing the most important parts of your system will help you prioritize alerts and determine the severity of incidents in production. Then, you can make sure notifications are sent to the right on-call teammates at the right times – avoiding unnecessary 4 AM wake-up calls. The more time you spend tweaking monitoring and alerting tools and assessing the importance of different pieces of your system, the more you’ll make on-call suck less without harming service reliability.
Setting up monitoring tools and classifying the severity of alerts is just the first step. Now, you need to find ways to make overall system health visible and offer transparency into useful information. Attaching runbooks, charts, logs, metrics and traces to automated alerts from your monitoring tools can offer useful context to on-call responders. This way, the on-call person or team can see exactly what’s wrong and quickly share information across multiple other teams. Transparency into system health and collaborative human workflows can lead to rapid on-call alerting and incident response.
With transparency into metrics, system health and workflows, it’s easier for DevOps and IT teams to identify potential issues – often before they happen. Also, the team can see which services are likely to self-heal over time and can optimize alerts accordingly. If a service is likely to self-correct in thirty minutes, the on-call team probably shouldn’t get notified the second a threshold is surpassed. If the metric remains out-of-whack for more than thirty minutes, then you can alert an on-call user.
This thought process results in an on-call culture that doesn’t suck – one where people are only notified when it’s absolutely necessary. Constant monitoring allows you to see which services are frequently notifying on-call users and which ones are more stable – helping you prioritize production issues and fix major problem areas first – creating more resilient architecture.
Without thorough analysis after-the-fact, you can’t continuously improve on your monitoring, alerting and incident response processes. By regularly conducting post-incident reviews, you can learn how people, processes and technology interact and continuously improve operations. You can find ways to improve the collaboration between developers and IT operations while simultaneously ensuring monitoring and alerting is always up-to-date. Post-incident reviews need to surface actionable tasks such as updating runbooks, changing an on-call schedule, adjusting alert routing rules or tweaking a monitoring tool.
Think of post-incident reviews as a way to analyze everything from technical systems to human interactions. A thorough post-incident review will shed light not only on technical monitoring and alerting blind spots but on workflow transparency issues. It’s also important that post-incident reviews are not focused on blame but are focused on exposing information and creating actionable takeaways.
With all of your learnings and the implementations of the above monitoring and alerting best practices, you can now put together a cohesive incident response plan. From incident detection to post-incident analysis, the team needs to know where to find information and how they should collaborate. DevOps-centric teams are great at automating manual tasks, opening lines of communication across multiple teams and improving transparency across all operations. Building an effective response plan is about asking the right questions to understand the way your systems and people function.
How do you communicate with internal and external stakeholders during a firefight? How do you communicate internally? Where are you recording the work that happens during a firefight – are you only tracking human communication or are you also tracking technical system changes? What can you do to surface context to on-call responders faster? Building a collaborative incident response plan will take time – and it’s never perfect – but if you dedicate time to continuous improvement then you’ll eventually have more resilient systems and a cohesive plan for incident response.
Contextualizing monitoring and alerting data is about more than simply serving metrics to people when they need it. Context is also about showing what the system data means. Runbooks and other instructions can show an on-call responder exactly what an incident means to the system as a whole. And, the on-call responder can see what this incident means to the people currently on-call. Instead of spending time diving through multiple tools, log files and metrics dashboards, the on-call team will see what’s wrong faster – allowing the team to fight fires, not find them.
A combination of preparedness and analysis leads to a proactive incident management process. Instead of fighting fires, the team now spends more time developing new features and services. A combination of transparency, automation, collaboration and analysis leads to a flexible, robust system and a team ready for anything. Armed with more information and a more resilient system, you can now start to implement SRE and DevOps teams who can dedicate time to building reliability features and proactively running application and infrastructure tests.
Shortened feedback loops between developers and IT teams leads to more time writing and deploying new code. Instead of reviewing deployments, fixing production issues, re-writing code and running manual QA tests, DevOps teams are automating numerous tests and functions throughout the SDLC (software delivery lifecycle). And, when the process can’t be automated, DevOps organizations are constantly finding ways to improve the collaboration and visibility between people.
But, DevOps doesn’t stop at engineering and IT. DevOps principles feed into marketing, sales and customer support teams – leading to better business decisions and more transparency across all business units. While DevOps improves the release speed and service reliability of software delivery for engineering teams, it also exposes more opportunities to the business teams. With rapid deployments and more transparency into features being shipped to production environments, business teams can deliver value to customers faster and more effectively.
A collaborative DevOps culture combined with a comprehensive monitoring and alerting strategy leads to proactive incident response and faster, more reliable software development. Get a free copy of our latest eBook with Catchpoint, 6 Ways to Transform Your Monitoring and Incident Response, and see exactly how you can build your own proactive monitoring and incident response plan.