Bethany Ross Abbott was hired as the first Technical Operations Manager at NS1 to build out a Technical Operations Team and create a systematic incident management protocol. Prior to Bethany coming on board, a small team of four DevOps engineers was responsible for keeping NS1’s platform up and running 24/7. Because of the team’s small size and the startup nature of the company, individuals were required to take week-long on-call shifts on top of their normal DevOps work during which they received numerous pages that often kept them up all night long.
“People really didn’t enjoy their on-call week. I’d hear people talk about the crazy hours they kept and how they didn’t feel like they had lives outside of work. I’ve witnessed burnout happen, and I didn’t want that to be the case at NS1,” said Bethany.
On top of the less than ideal on-call setup, Bethany said that most institutional knowledge of NS1’s infrastructure was stored inside people’s heads, meaning the same person was often in charge of resolving the same incident over and over again, and onboarding new team members was a major undertaking.
“I walked into a DevOps team of four people who had everything in their head but had none of it documented. It was very unsustainable,” said Bethany.
Making On-Call Suck Less With VictorOps
To ensure NS1’s technology was reliable 24/7 and remove some burden from the DevOps team’s shoulders, Bethany set out to build an international Technology Operations team and, most importantly, find a way to make on-call suck less.
Bethany decided to use features such as the VictorOps on-call rotations, escalation policies, runbook links and integrations to create a more humane on-call experience for her team. “VictorOps is what notifies my team if there’s an issue. It’s what all of our pages come through, so everything feeds through VictorOps to let whoever’s on-call know what’s going on,” said Bethany.
Bethany’s first hires were two additional Ops professionals based in Vietnam, which has a 12-hour time difference from NS1’s New York headquarters. NS1 uses the “Follow the Sun” rotation model in VictorOps, which means no one needs to wake up in the middle of the night to answer a page. NS1’s TechOps team has grown to six people spread across Vietnam, New Hampshire, New York and Utah.
Next, Bethany sat down with all of the DevOps team members to get critical system knowledge out of their heads and into runbooks, which are stored in Confluence. Now, when a VictorOps alert comes in, the incident commander can quickly click into the incident detail pane, find the runbook link and either take steps to resolve the issue or escalate it to the appropriate team.
Bethany also took advantage of the VictorOps Slack integration, starting a dedicated channel for VictorOps alerts. When a VictorOps alert is triggered, the on-call team member can acknowledge, re-route or snooze it directly in Slack. Bethany said this has made it much easier to collaborate with customer support team members on incidents with a customer impact. And when a new shift starts, team members can easily get up to speed by reading through the Slack conversation, which is also captured in the VictorOps Timeline.
Bethany implemented a requirement that every on-call team member select at least two different notification methods through the VictorOps Personal Paging Policy to ensure accountability. Bethany said most people, herself included, choose three different notification methods to ensure they don’t miss any alerts.
“The Most Sustainable On-Call Schedule I’ve Ever Experienced”
Bethany’s work hiring team members in strategic time zones and utilizing VictorOps to create a humane incident management process paid off tremendously. “I’ve gotten a lot of feedback that we have one of the most sustainable on-call schedules people have ever experienced,” said Bethany.
She said that VictorOps has made it easy to disseminate system knowledge. Uptime no longer depends on the information held by one person — all on-call team members are empowered to resolve an incident thanks to context-rich alerts.
With features designed to account for the unpredictability of every-day life, VictorOps helps Bethany’s team go with the flow. For example, if something ever comes up during an on-call shift, team members can use the “Take On-Call” feature right from their phone to temporarily pass off their shift to someone else without needing to schedule an override.
After experiencing a series of devastating DDOS attacks at a previous DNS company, Bethany even decided to set the entire NS1 leadership team up with VictorOps licenses. Now, if a major incident occurs, everyone will be notified instantly.
Bethany says she relies heavily on VictorOps to keep NS1 up and running because: “Thousands of customers rely on us to service their customers, so we need to know what’s going on in our infrastructure 24/7. Plus, whenever I’ve had questions or issues, the VictorOps support team has always replied to us very quickly. I feel like VictorOps appreciates us as a customer and listens to our needs.”
Enhance visibility and collaboration for on-call incident management as your team scales, while simultaneously reducing alert fatigue. Sign up for a 14-day free trial or request a personalized demo to see how VictorOps can make on-call suck less for DevOps and SRE teams.