Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important.
For those unfamiliar with Incident Management, there are some common misconceptions that fuel a fear of accepting on-call responsibilities. Chief among those are:
- I’m going to be woken up for no real reason (false positives)
- I’ll miss a call and get in trouble (escalations)
- I’ll get alerts for things I can’t fix (actionable alerts)
Today we’ll tackle the first two in a working example. Using a fully configured (if simple) Incident Management system we can see how to combat false positives, and provide a safety net for our newly minted on-call team.
Anyone who has worked alongside or knows an Ops professional likely has a basic grasp of Incident Management. Something goes wrong (with a Monitored system), some other thing notices it (Monitoring system), a notification makes a phone go beep beep beep (Incident Management system), and your friend or coworker has to drop what they’re doing and respond.
Let’s dive into each of those in a bit more depth.
Are just that - the servers, switches, routers, instances, containers, clusters, or databases being monitored. Monitoring can generally be broken into a couple categories: System, Application, and Business Objective (or passive).
- System monitors tend to be generic and apply to the system resource consumption: CPU, Memory, Storage usage or capacity, Network, etc. System monitors are historically the realm of the Ops team.
- Application monitors are more specific, and tend to test certain conditions or workflows in the application - Homepage returns an HTTP 200, Page Load time is < 1 s, etc. Application monitors tend to be more interesting to developers.
- Business objective monitors measure business performance as an indirect way to gauge application or infrastructure health. Some common examples are transactions/sec or new user registration/hour.
All types of monitored system checks share some basic characteristics:
- Check frequency - how often the check runs
- Thresholds - how the check gauges health/failure
- Maximum failures - let the check fail N times before triggering an incident
Of course, there is significant variance in the way systems are monitored. Active/Passive, log based, data inference, and many other approaches are used to isolate a basic truth: Things are OK, or Things are Not OK.
Are the system(s) that actually run the checks, receive the results, check the result against the configured thresholds, and trigger an Incident. Oftentimes monitoring systems are also Monitored systems (Quis custodiet ipsos custodes?). Monitoring systems tend to be fairly complicated to configure, and offer highly flexible rule processing. Some examples include Nagios, Zabbix, Datadog, and Zenoss.
A monitoring system may include an Incident Management system, or simply integrate with one. Incident Management systems, like VictorOps, apply a secondary logical operation to an Incident. Based on configured rules, an Incident Management system:
- Determines what team should be alerted to this particular incident
- Who from that team is currently on-call
- What method of notification is appropriate
- Manage state of the Incident and handle Escalations or secondary workflows
Concepts in Action
Enough words, let’s have some pictures. For our purposes, I’ve got a very basic setup including one monitored system (test-webhost), one monitoring system (Nagios), and one Incident Management system (VictorOps) with two teams.
What Nagios, or any monitoring system, enables is some built-in protection against false positives. Anyone facing on-call is rightly concerned about interruptions that “aren’t real” or otherwise are triggered from transient failures instead of real incidents. This is where a properly configured monitoring system is important.
In our example, Nagios is configured to check http on test-webhost. While the Nagios documentation can be intimidating, the salient pieces for my example are here:
check_interval 5 max_check_attempts 3 retry_interval 1
Here’s our false positive protection: Nagios will check this service every 5 minutes (check_interval), upon a failed check, Nagios will retry again in 1 minute (retry_interval). Nagios will not send a notification of failure until it registers 3 consecutive failures (max_check_attempts).
Meanwhile VictorOps is configured with two teams: devTeam, and escalations. Our Nagios integration is configured to properly route alerts for that test-webhost to the devTeam through the magic of route keys.
Pretty standard weekly rotation for 3 member team:
Route keys simply match a value from the Monitoring System to a team within the Incident Management system:
Additionally, the devTeam is configured for some escalations. Escalations can work in a variety of ways, but typically the idea is “if no action takes place in X minutes, automatically notify someone else.” Escalations are the safety net to any properly implemented Incident Management system. Ultimately humans need someone to have their back - and automated escalations are the answer.
Beep Beep Beep
For grins, I’ve triggered an alert by disabling http on the test-webhost. Here you can see Nagios in the check/retry_interval logic as it has marked the service “CRITICAL”, but has not triggered an incident. The 6th column lists check ⅓ has failed:
Once the alert passes that max_check_attempts(3), Nagios forwards the alert to VictorOps, where the on-call team member (me) is notified. I’ll ignore the alert to force that escalation to kick in. First, the timeline indicates I’ve been contacted:
Then we see the escalations kick in 5 mintues later. This is the safety net - for unacknowledged or unresolved incidents:
At last, I ACK the alert. Note how the Ack cancels further escalations. Call off the hounds!
For an on-call team, ACK(nowledge) is shorthand for “I’ve got this”. An Ack by no means implies a fix, or a resolution of the incident. It simply communicates to the team, and the Incident Management system, that the alerted person is responsive and on the case. In many implementations, the ACK is the final word on escalations. However it is common to see an ACK silence further escalations, but if an incident remains unresolved for a longer time period, then a separate set of escalations can kick in.
For those seeking closure, the check recovered (after restarting the web server)
I’ve tried to provide some basic transparency here to the ways an Incident may be triggered, and how that Incident may cause your phone to make annoying noises at 3am. For anyone joining their first rotation, having faith in the systems behind the alerts is important. Engineers, as a rule, don’t do much on faith. As such I encourage the Ops teams, or whoever maintains the alerting systems to be active partners with the entire oncall team. Provide them clear explanations of how alerting works, under what conditions escalations trigger, and set clear expectations around responsiveness.