VictorOps is now Splunk On-Call! Learn More.

Back to Basics: Current Monitoring and Alerting Best Practices

Dan Holloran March 13, 2018

DevOps Monitoring & Alerting SRE

Most DevOps and IT teams know the importance of system monitoring. However, there is still very little standardization when setting up monitoring and alert systems. Every company, team, and individual operates differently. The responsibilities of a monitoring system range from collecting the right data, symptom detection, and alerting on what matters. As companies approach the daunting task of setting up a system that meets their needs, it quickly becomes clear that there isn’t really a one size fits all solution.

Understanding your software takes many forms and sometimes the best place to start is not at the beginning, but at the end. That is to say, you will first figure out what you would like to know about your system in order to setup your monitoring and alerting tools. These are the current best practices for organizations looking to be less reactive and more proactive with monitoring and alerting:

You can check out all of the amazing system monitoring tools that integrate with VictorOps here.

Identify Current and Potential Problems

System monitoring begins with understanding what’s currently happening and knowing what could potentially go wrong. You need to know what’s going on with your system right now, but you also need to know the possibilities for what might happen within your system and how this will affect your team and, ultimately, your customers.

The responsibilities of a monitoring system don’t end with symptom detection. Good monitoring practice will not only notify you of a problem, it will help you diagnose and navigate to possible system failures. With this knowledge, your team can also implement auto-remediation tools to limit human interaction and quickly resolve incidents automatically. Applying monitoring tools that give the entire organization a transparent and observable view of your systems empowers your team to recognize both vulnerabilities and capacities. This visibility ultimately helps you optimize systems in order to make them stronger and more flexible.

Collect Actionable Metrics

As we mentioned earlier, monitoring data comes in many forms. Effective monitoring requires the ability to collect the metrics essential to the system and organization uptime. These metrics need to be collected and shared with the proper team members, so they can use this information to improve the system in a timely manner. The collection of intelligent metrics provides your DevOps and IT teams with the ability to more quickly identify, diagnose, and resolve an incident.

Define and categorize the metrics which are important to your organization. One recommendation would be to categorize these metrics according to internal versus external information.

  • Internal: Monitoring throughput, success, error, and overall performance of a system. These metrics are incredibly important for overall system functionality and ensure the systems or networks, on which your organization depends, are available and actively doing what they were set up to do.

  • External: Monitoring latency, saturation, errors, and traffic. External metrics are metrics that will be influenced by external factors (most likely your customers). Monitoring these will help you reconstruct a detailed picture of a system’s state after an incident.

Arming engineers with accurate, organized data is the first step in incident resolution. Once your engineers are more well-equipped with an understanding of what’s wrong, they can use the information to take the appropriate action. Actions could include, acknowledging an incident, escalating an issue, getting additional engineers involved, or re-routing the incident to a different team. Establishing and monitoring KPIs ensures engineers will take action on prioritized issues to reduce downtime and, in turn, improve your overall engineering workflow.

What happens after the incident? Check out our O’Reilly book on Post-Incident Reviews.

Actionable Alerts

Alerts are essential to monitoring. They allow teams to identify problems anywhere in the infrastructure, and support rapid remediation to quickly identify issues and minimize business critical system downtime.

Unfortunately, alerts don’t always communicate what they’re supposed to and aren’t always as effective as they should be. Effective monitoring and alerting comes down to painting a detailed picture of your system by providing contextual information to your team when they need it through runbooks. And then, once your team has the information they need, they can set up alerts that are timely and pertinent. Your team should not be confused by the data or the alerts that they receive, they should be able to take this information and turn it into solutions.

That’s why VictorOps provides you with the necessary transparency and access to the right integrations, tools, and information to bolster your incident management system. Allow your entire team to find exactly what they need, when they need it. Sign up for your free trial to see how your essential monitoring and alerting tools can work with VictorOps.

Let us help you make on-call suck less.

Get Started Now