Improving Incident Response

Tara Calihman - September 29, 2015

As I’ve mentioned before, Monitorama is an amazing conference for the quality content it provides, the single-track framework and the opportunity to meet lots of smart people. One of those talks given by one of those people was all about incident management.

Curt Micol, from Simple, covered different aspects of how they handle incidents internally. It was a timely presentation as we’ve been implementing our own incident response protocol around here. He talked about how they have created their Incident Command Organization, which is made up of an Incident Commander (whoever is on-call), an Incident Signaller (a communications person) and an Incident Engineer (whoever is going to fix the problem).

By training their people to handle communication and coordination, Simple has built up the confidence in their employees and empowered them to work together to solve the problem. If you are interested in hearing what Curt has to say, I encourage everyone to watch his talk because it is chock-full of hard-earned wisdom.

Monitorama PDX 2015 - Curt Micol - Incident Management and the Incident Complexity Framework from Monitorama on Vimeo.

There have been some recent experiences of companies having big issues and then waiting a long time to tell their customers. In order to proactively prepare for and prevent events like this, we’ve created a Crisis Communication team. We’ve taken note of what’s worked at Simple and built our team from different units in the company, including Ops, IT and Marketing.

Here’s what else we’re doing…

 – Additional training. One of the members of our team attended a two-day course from the Institute of Crisis Management and got certified as a crisis communicator. She was able to dig into how other companies are handling events, learn about best practices and receive practical advice on talking to all stakeholders.

 – Practicing empathy. Everyone on the Crisis Communication team has been on-call before; some of us for many, many years. We’ve all been through the drills, we know how much it sucks and we can put ourselves in the shoes of our customers quite easily. This is tremendously helpful when picking the right path with our most valuable stakeholders.

 – Conducting post-mortems. We meet to talk about how we did after each incident, discussing what worked and coming up with action items to improve our process. It’s the only way to learn from your mistakes and become better.

 – Drinking our own Kool Aid. Because the VictorOps incident timeline provides one place that you can easily get the context you need around the event, we use our own product (crazy!) to coordinate communication during a firefight. We also have a Slack room dedicated to Crisis Communication and should our own platform go down (shudder), we can easily hop in there to keep the conversation going.

 – Creating transparent objectives. Simply put, we want to provide updates to our customers on what we know, when we know it. This is our mission statement, guiding principle and promise to everyone using VictorOps.

** – Maintaining preferred communications: **Statuspage and our Twitter support handle are two simple but effective tools for updating our customers. By keeping these channels refreshed during an incident, we can tell our customers what’s happening and have a direct line of communication. You want to be talking in the places that your customers are hanging out and these two are essential for our audience.

In these days of social media ubiquity, you can’t just ignore an incident and hope that no one notices. You can either have your customers hear the news from you or they can hear it on Twitter from someone else. We’re embracing a top-down culture of transparency which means that notifying our customers two weeks after an incident happens isn’t an option.

We’re also learning as we go, so if you have any feedback or ideas, we welcome it.