Will La - March 23, 2016
Let’s go over the integration between VictorOps and Hipchat. The integration is a bi-directional integration in which anything that is entered in the VictorOps native chat functions will show up inside of HipChat and vice-versa.
It’s a DevOps best practice to have your team members chat all of their activity when triaging and resolving an IT incident. It captures the thinking process of how the teams begin to work on the incident, and can really impact your postmortem process. One team member may chat about what monitoring tools they dove into, while another may bring up which servers they have already reset.
If there are questions being asked between teams, it is good to have those on the record as well. This gives the teams more context as to why some actions were taken. All of this tribal knowledge is important when you are trying to improve your Incident Response Strategy.
[Note: Although face-to-face communication can speed things up, it is also hard to capture for the record. Leveraging the HipChat platform to collaborate will help teams record their activities.]
So your teams are chatting each other up in HipChat, now what?
You’re recording! Then you can put that log in the timeline along with your alerts, which are also in timeline order. Fortunately, this is easy to do when using the VictorOps and HipChat integration.
Timeline order you say?
Yes, let’s think about that for a moment. When your monitoring alerts and your chats are intertwined in sequential order, you accomplish two things. You are recording the technological events that occur within the stack and the human events that occur at the same time. The second thing you accomplish is putting these recorded events in order to paint a better picture of what happened.
Here’s a simple example (most recent first):
This is where the what meets the why. Why do we do all of this? When you have this information recorded in timeline order, you can begin to pick up feedback, insights and patterns into how you can improve the way your team responds to incidents. The ultimate goal here is to reduce downtime by responding to alerts more efficiently and effectively.
The feedback you receive comes in the form of seeing what worked and what didn’t. If you are seeing a particular action never resolving the issue, then you can deprioritize that action for this particular alert and save time by focusing on actions that work. You wouldn’t reset the server over and over again if the alert continues so your team should be checking other places.
Same goes for the actions that do work. If a specific action resolves the issue, that’s great! You now have the action on record that led to the resolution and next time you can prioritize those first. This is the feedback cycle that helps the teams know what’s working and what’s not. Otherwise, you keep burning time repeating ineffective actions.
However, finding the resolution quickly is not always an option. Sometimes you don’t get a resolution, but you get clues. This is where** insights** kick in. When you find a clue, you want to chat (record) into HipChat what you did to find that clue. “I’ve looked at the DNS and it looks good but this is weird….”.
The team begins to leave behind a trail of how they eventually found the resolution. You can turn these insights into actionable steps in your response plan (runbook) or you can have discussions regarding the motivation behind these actions and in turn educate the team. Insights become a key factor in improving your processes since they will allow you to see new incident-response tactics.
When the clues lead nowhere, you can start by looking for patterns. Over time, you will begin to see multiple alerts with individual responses from the teams. Looking at the HipChat activity and the alerts in an overarching “super lens” across all of the incidents will give you a different view of how your teams are responding to the alerts.
You may be able to see that alerts are being handled better during different times of the day or when different people are involved. You can use this pattern data to help compare and contrast one set of incidents vs another set of incidents with different teams involved. We hope that you don’t have so many incidents that you’ll get big sample sizes but incidents happen, and if a specific incident happens more than once, you want to capture what solutions teams are trying each time so you can look for patterns.
So there it is. You want to have your team members share their tribal knowledge by chatting their actions. Those chats should be recorded alongside the alerts in timeline order. These timelines should be reviewed to give you feedback, insights and patterns during your incident postmortem. That information should guide your discussions on how you’ll lower downtime by getting to the resolution faster.