Ernest Mueller thinks a lot about velocity. As Director of Engineering Operations at AlienVault, he facilitates continuous delivery, speeds incident response, and builds systems that alleviate developers from the burdens of infrastructure, so they can focus on deploying applications as fast as possible.
The Challenge: Building an On-Call Process from Scratch
When AlienVault decided to build and support SaaS products, Ernest brought his expertise to the company to develop organizational concepts of software velocity, incident response, and on-call support.
First, Ernest needed to get organizational buy-in for the on-call process. The AlienVault support team was well-acquainted with the customer support role, but Ernest needed developers and other stakeholders to join on-call availability. He explains, “If the problem is not something routine that a support team can handle, then somebody close to the code should be available. That person is in a better position to fix the problem long term.”
When looking for a platform to handle incident management, Ernest decided to explore PagerDuty (which he has used in the past) and VictorOps. He says, “I was specifically impressed by VictorOps’ functionality around the entire incident management process. Both VictorOps and PagerDuty route alerts, but that’s table stakes. Being able to manage the rest of the process—people working an incident, helping them collaborate, holding constructive postmortems—that’s what I was looking for. VictorOps had more functionality in those areas.”
On-Call Rotations Are Complex and Multi-Tiered
AlienVault uses VictorOps to set up complex calendars and schedules with on-call engineers located in the US and Madrid, a place where work hours vary by season. “Being able to route alerts based on a complex set of rules is important for us,” says Ernest. “If something goes wrong, we want to route the alert to somebody who is already awake.”
Each product team is part of a first-tier, on-call rotation. Second-tier escalation policies include developers, operations, product, management, and support team members.
“Having this complex scheduling and escalation functionality is important,” says Ernest. “Every monitoring tool will send alerts, but the ability to define the means of communication and escalations are important for ensuring we get the experts we need to maintain high-quality service while also maintaining high-quality of life for our engineers.”
VictorOps Provides Context to Upper Management
Via VictorOps team functionality, AlienVault product managers and upper management are equipped to receive notifications though they are not officially on-call, keeping them in context when something needs their attention. As for notifying customer and stakeholders, AlienVault takes advantage of the VictorOps/StatusPage integration to quickly and easily let them know what’s happening during an incident.
Supporting ChatOps and Retrospectives
AlienVault uses Atlassian’s HipChat as their primary collaboration platform and the VictorOps/HipChat integration for incident response. Ernest says, “The ability to gateway the messaging from VictorOps and back is very valuable. Short of sending drones to watch people, chat provides the closest reflection of actual timeframes. Since the information from HipChat shows up in the VictorOps retrospective report, it’s a lot easier for us to get a meaningful timeline of what happened.”
Appreciating the Power of the Rules Engine
Ernest admits that every environment has its quirks, and he appreciates having the VictorOps rules engine (a.k.a. Transmogrifier) to apply logic to specific alert requirements. “Our processes are 90 percent predictable, but ten percent of the time, they vary. The VictorOps Transmogrifier and webhooks give us that release valve, so we can build the customizations we need, as opposed to asking VictorOps to make any changes.”
It’s about Clear Communication
Ultimately, on-call is about people collaborating and staying in context. “One of the most dangerous things in incident response is splitting your communication channels,” says Ernest. “Anytime somebody says, ‘We’ll start an email thread on it,’ and it’s not tied into the originating channel, you’re asking for chaos. We’re big believers in a strict single communication channel, and we use VictorOps, integrated with HipChat, for that. You can initiate a call, but you would do it out of VictorOps to keep the communication thread intact.”