Be Prepared: Here’s Your Incident Management Checklist

Amanda Boughey March 16, 2018

Monitoring & Alerting On-Call SRE
Be Prepared Incident Management Checklist Blog Banner

When an incident occurs—regardless of the severity—you need your incident management checklist loaded to quickly and seamlessly handle the issue. Like most things, the best way to resolve incidents is to plan and prep ahead of time. Knowing what’s likely involved, at all severity levels, before an incident occurs ensures you’re not wasting precious remediation time.

The types of incidents occurring in your application are varied and unpredictable. The best case scenario is running into a small incident and simply needing the on-call engineer to jump in and fix the issue. Easy peasy. Put a bow on it and call it resolved.

Other issues are recurring and can be handled with auto-remediation. Depending on the structure of your alerting, these issues may or may not even produce an alert. If your application has an unstable environment, you may see notices of issues being auto remediated as they are fixed without human involvement.

It takes longer to resolve complicated incidents. Dealing with complicated incidents take you away from the simple remediation stage of incident management—which is usually what matters for simple or auto-remediation issues—and adds in analysis and readiness activities to the mix.

To proactively prepare for incidents, you need several other tools in your toolbox. Not to mention, you’ll likely have more experts involved. With a larger incident, your firefighting team will include the incident commander, someone on the support team, and an on-call engineer (or two or three depending on who sees the issue occurring and jumps in to help).

When you run into these larger issues, you need:

  • Runbooks
  • Metric dashboards
  • Archived history
  • Devices
  • Centralized information

Runbooks

The best case scenario for an incident is having a runbook already created and attached. Runbooks help capture how a problem was solved previously, allowing you to quickly access and implement the best solution. Creating runbooks might be time-consuming, but taking the time away from your busy schedule to outline how you resolved an incident will help you and your team in the future.

Metric Dashboards

It’s important to keep your pulse on how systems are changing while you’re trying to resolve an incident. Receiving an alert with information pointing to a metrics dashboard will help you quickly visualize whether the situation is improving or not. Being able to see how the system behaves while you’re trying to find a solution is the extra lens needed to get something up and running quickly.

Archived History

One of the easiest ways to stay sane when you’re managing an incident is to rely on history. Consider creating a Wiki or saving a set of tabs on your browser so you can easily access all the resources you use during a firefight. You’ll want to have easy access to the servers you need to connect to, as well as quick access to any tools your team uses to help remediate issues. There’s a lot you’ll want saved away to simplify your incident management—it’s smart to save it all in one location.

Devices

You can’t forget the obvious needs when you’re getting ready to fight your fire. You’ll need a VPN, your laptop and/or mobile device, and a tool to communicate and centralize what’s happening. With bigger incidents, the need to communicate can become a huge problem because you have more than one person involved in fixing the issue. Communication needs to be centralized across the entire team so information is provided to everyone involved and no one gets stuck fixing an issue that’s already being handled by another team member.

Centralized Information

Without a tool for centralized communication and information, you’re delaying the time to resolution. If you’re the incident commander and you need help in the middle of the night, you’ll likely start by texting your support or on-call engineer. Because it’s the middle of the night, that probably won’t work. So now you’re calling someone—waiting for them to wake up enough to understand the issue. You start to feel needy, don’t want to interrupt someone else’s night, and have a hard time getting the information across. In addition to this, you’re now switching back and forth between the system and the phone, so you’re losing time as you recalibrate to dive back into the incident.

Luckily, incident management tools help ease a lot of this pain. By implementing a tool like VictorOps, you’ll have quick access to everyone in an incident and you’ll get all the information in one location. You can simply tag the incident commander, assign something to support, or alert the on-call engineer without having to dig through a list of team members to see who is on-call at that time—VictorOps does that for you.

Examine your own incident management checklist before your next incident occurs. Make sure you’re using all the right tools, processes, and methods to make your on-call life suck less.

Ready to get started?

Let us help you make on-call suck less.