VictorOps is now Splunk On-Call! Learn More.

Practical Tactics for Improving IT Alert Automation

Dan Holloran August 31, 2018

DevOps Monitoring & Alerting On-Call
Practical Tactics for Improving DevOps and IT Alert Automation Blog Banner

Alert automation might sound too good to be true–especially if you’re using some type of homegrown alerting solution. IT teams need to know there’s a better solution than manually alerting via email. In this type of system, IT professionals are inundated with alerts–some actionable, but many unactionable. Alert automation can lead to more actionable notifications, better alert prioritization, improved incident collaboration, and additional alert context.

But keep in mind–automation for the sake of automation is not helpful. Alert automation needs to help manage notification routing and severity to avoid placing unnecessary stress on your IT or DevOps team members. Practical tactics and processes for alert automation equate to a better experience for your people. When people have a better on-call experience, incident response and resolution happens faster.

So, in order to save your teams from a few headaches and sleepless nights, we’ve put together a short list of practical tactics for improving DevOps and IT alert automation:

Tactic 1: Schedules

You’ll need a manually set schedule to get started. But, through alert automation and follow the sun rotations, you can start to organize your alerts and strategically approach your on-call calendars. Alert automation can route alerts to the proper on-call person based on the schedule and could even route the alert based on timezone/geographic location. This way, only the IT or DevOps team(s) affiliated with fixing an issue are alerted.

Flexible schedules linked with your alert automation will allow you to take on-call shifts from another individual, without breaking the alerting rules. Again, don’t simply think of how alert automation impacts the alert, but how it impacts a person’s incident response. Scheduling optimization, in association with automated alerting, makes on-call suck less and improves quality of life for IT professionals and DevOps engineers.

Tactic 2: Escalations

Scheduling is essential functionality, but adding automated alert escalation makes sure that alerts don’t fall through the cracks. If the on-call person doesn’t respond in a timely manner, an alert can automatically escalate to the next person or team to make sure that problems are acknowledged. Automation makes it easier for people to spend less time moving the alert around and more time working on the incident itself.

Any system allowing automated and manual escalations works best. Your automation needs to work in a way that boosts your people’s productivity and quality of life. But, when dealing with unknown unknowns, you can’t plan to automate every little thing. Adding manual escalation functionality helps for incidents that may be difficult to organize through automation.

Give the on-call person the ability to tag teammates who need to get involved, as well as the ability to manually escalate incidents in your incident response timeline. The point of effective escalations and schedules is to get the correct people involved at the right times–and the best way to achieve this is through a hybrid model of manual processes and automation.

Tactic 3: Visibility

Automation should surface issues to the proper parties. With little to no human interaction, incidents should be routed to the proper person and give visibility to any other affected parties. Anyone who needs to interact with the alert should be able to access the information they need, when they need it.

By setting monitoring thresholds to automatically send alerts when ETL lags or disk usage spikes, you get a deeper visibility into your system’s inner workings. Over time, you can give IT and DevOps teams the visibility they need to identify problems and remediate larger issues–sometimes even before they occur. Both ITSM and DevOps teams can benefit from more visibility via alert automation.

Tactic 4: Communication

As you’ve probably learned through a number of our other posts, collaboration is a big deal in DevOps. But, the ability to communicate well and collaborate is also a big deal for IT teams–and quite frankly, in much of life. Alert automation, within a collaborative incident management platform, allows people to spend more time communicating around issues rather than escalating and routing notifications.

Some ChatOps tools can help automate and optimize communication around alerts. ChatOps tools can allow you to tag, run action scripts, or even set up automatic alert routing based off of keywords–all within the chat application. Create an end-to-end collaborative incident response platform with alert automation and integrated chat tools that improve the human element of incident workflows.

Tactic 5: Context

A simple alert telling you something’s wrong isn’t a comprehensive solution. Adding context to your alerts is necessary for speeding up incident diagnosis and resolution. When setting up incident workflows, you can set up automated alert routing and escalation, but you can also use automation to transform payload information and make it actionable in the context of the incident.

Through alert automation functionality like our Transmogrifier, you can customize the behavior of incidents, set certain alert conditions, and establish actions to be triggered when those conditions are met. Automatically routing, escalating, and transforming alerts are the three key components of creating relevant, contextual incident workflows.

Craftsy and Alert Automation

Create collaborative end-to-end alerting and incident response workflows in VictorOps. Sign-up for a 14-day free trial to see how IT and DevOps teams are leveraging collaborative incident management to set schedules, automate alerts, and remediate problems.

Let us help you make on-call suck less.

Get Started Now