How Workiva Built a Culture of DevOps and SRE

Dan Holloran August 03, 2018

How Workiva Built a Culture of Devops and SRE Banner

Creating a DevOps environment of collaboration, code ownership, and accountability inherently helps teams build on SRE efforts. We spoke with Mike, an SRE Manager at Workiva, about how their culture has evolved over time to promote more of a DevOps-oriented SRE approach.

The History of SRE at Workiva

Four years ago, Mike started at Workiva in the Production Systems Report Group which, over time, eventually turned into their SRE group. The SRE team at Workiva is responsible for “reliability, quality, and efficiency of production systems internally and externally.” When Mike started at Workiva, they were using a mishmash of homegrown solutions and Nagios for monitoring, alerting, and overall incident management.

But, Mike felt there could be a more comprehensive way to start receiving alerts and collaborating to resolve incidents.

The Culture Shift to DevOps

Workiva recognized a structure where developers would build something and simply throw it over the wall to Ops teams. The team wanted a better way to record feedback for engineering operations as a whole and become more collaborative.

So, Workiva merged infrastructure and Ops teams, and broke those teams into squads. These squads were then responsible for splitting their time between both software and operational engineering. This way, everyone gained exposure to both development and maintenance in order to better understand the way their system worked. This worked as a solid interim system while Mike and others built out the SRE team.

After nearly three months, the Workiva team has 13 people on the SRE team. The new collaborative team structure, in combination with a solid incident management plan, is meant to increase the reliability, infrastructure, and the quality of products on that infrastructure. Now, DevOps professionals on the team can easily “hand the pager” back and forth and allow people to fix issues for their own products, speeding up incident resolution.

With a successful SRE team, it became easier to spread the new culture of shared responsibility and code ownership across disparate product teams as well.

Scheduling for On-Call SRE

DevOps, SRE, and Collaboration

As the IT and software development landscape changed, incident management, SRE, and deployment processes also changed. But, through all of this, the importance of system reliability for customers didn’t change. If anything, customers expect service reliability more than ever. Continuous integration, both with and without third-party applications, in association with more consistent deployments, makes maintaining service availability more difficult.

Homegrown monitoring and alerting solutions take time away from development and have a harder time keeping up with interconnected applications and services than purpose-built incident management solutions.

Mike knew there had to be a better way to get actionable alerts in a more centralized location. So, Mike went out and began to assess out-of-the-box incident management and collaboration solutions. After evaluation, Mike ended up choosing VictorOps.

Naturally, we wanted to know why.

Incident Management Across Disparate Teams with VictorOps

As Workiva bought in to the culture of DevOps and SRE, they needed to move quickly and improve visibility into development and operations. At first, only Mike’s team was using VictorOps, but over time, more product teams adopted VictorOps. This way, multiple product teams could remediate incidents, collaborate cross-functionally, and improve overall visibility.

At Workiva, different teams are using different features of VictorOps. But, every team is using the Statuspage integration and on-call scheduling functionality in some capacity. The IT department, infrastructure and reliability team, SRE team, R&D team, and software support engineering team are all using VictorOps.

Making On-Call Suck Less

Workiva’s SRE team has personnel in the US, Canada, and Holland. By using VictorOps for on-call scheduling, engineers can get a break due to on-call handoffs based on timezone. On-call teams improve flexibility and incident response time with Transmogrifier, Manual Take On-Call, alert routing, and customizable escalation policy functionality.

All-in-all, approximately 40-50 people from cross-functional teams within Workiva are using VictorOps to collaborate and remediate issues. We were incredibly happy to hear during our conversation with Mike how VictorOps helped Workiva improve internal workflows and overall “quality of life.”

Sign up today for a 14-day free trial to improve your own quality of life and make on-call suck less.

Ready to get started?

Let us help you make on-call suck less.