The Response phase of the Incident Lifecycle is the delivery of a notification to an incident responder, via any means, and the first steps that the responder takes to address the alert. This seems pretty simple: a threshold in Detection is passed, an email/sms/chat/phone/semaphore line/carrier pigeon is sent. Someone acknowledges receipt and opens their laptop.

When a team is focused on MTTR, every second they can reclaim has meaning.  While this phase in the Incident Management lifecycle is in play, that very simplicity can blind teams to the real opportunity presented by focusing on their Response.

Getting the word out

Breaking this phase down, we can see that it consists of two specific parts: Notification and Response. Somehow, someone has to know that something is wrong, and they have to signal to everyone else that they are working the problem. The Response phase also includes the first few critical moments of a responder digging into the incident.

How a responder gets notified happens in any of the channels available to us today. Cellphones, SMS, email, landlines, chat systems, audible alerts, flashing lights, or the ever popular “person running down the hall screaming” are all valid methods of notification for an incident. In practice, teams implement notification across all of those channels. Increasingly, as teams adopt ChatOps tooling and workflows, they centralize communication from many sources in a chat client.

While VictorOps supports integrations with many of the popular ChatOps platforms (Google Voice , HipChat, Hubot) today I want to focus on our Slack integration. The basic setup is well described in the integration guide, but how can you extend this straightforward approach to really set your first responder up for success?

Enhanced notification

Receiving a basic notification via a Slack channel is pretty great:

There’s some basic information here, but not much to go on. An HTTP check is responding with a 503 error on a host. Each second spent in the Response phase contributes to the time-to-resolve the incident. How can we get that responder further down the road?

That’s a much better head start. Our incident responder has a visual indication of the current state, and links to relevant runbooks and dashboards. All the configuration, as we will see, is driven by convention enabling this approach to scale to any alert, any system. Let’s take a look at the details.

Attachments in Slack

This additional contextual information is nicely presented via the use of Attachments in Slack. Formatting tastes vary, but attachments are the cleanest way to present a variety of information within the platform. As you saw in the baseline integration guide above, you must configure outbound webhooks in VictorOps to send the information to Slack. The provided example is fairly simple:

As you can see, we’re passing some dynamic information from the alert payload (${{STATE.INCIDENT_NAME}}, ${{ALERT.monitoring_tool}}), etc, as well as the configuration steps from the integration
${{ALERT.slack_channel}} to route alerts to the proper slack channel.

Given the increasing prevalence of Infrastructure and Monitoring configuration techniques, we have a great basis to build on this simplicity. Any information present in the alert payload can be used to provide additional details to the on-call responder. Let’s take a look at the configuration that created our enhanced alert:

As you can see, we use the ‘’fields’’ parameter to create a formatted array of whatever might be useful for a Responder dealing with an incident. In this case, runbooks, a link to Grafana dashboards, and an embedded image depicting current state.

We’ve used the current value of the fields ${{ALERT.alert_url}} (in this case, http_check) to link to a specific runbook on what that alert means. We’ve also pulled in ${{STATE.HOST} to land on dashboards for the associated hosts. Lastly, we use the slack image_url parameter to link to a specific jpg depicting the current state.

Many paths to MTTR

As a team focuses on Response, many options present themselves. Focusing on specific notification channels, like SMS instead of email is always an easy win. Ensuring consistent and well understood escalation policies is another. Contextual data in incident messaging is a tougher nut to crack, but really presents a next-level technique for a team.

What’s most important to provide to the responder? Is it runbooks and dashboards as I suggest here? A link to restart a service? Data from a deployment pipeline? An interface to managing infrastructure ? The answer will vary by team and alert, but finding an answer for your team requires focused thinking.

Start small and review how effective (or not) new data in the alert is using your Post Incident Analysis time. An investment in ChatOps reaps rewards in more efficient Response, and overall reduced MTTR. At VictorOps we recognize the key role ChatOps plays in Response, and other phases of Incident Management. Keep an eye out for more empowering features through our partnership with Slack soon!