Building resilient systems takes a mindset dedicated to continuous improvement. SRE takes many different forms, but should always be focused on improving reliability and cross-functional collaboration.
In a previous post, we discussed the evolution of our SRE council, how it works for our team, and how it helps us build more reliable systems. The council also serves to improve collaboration and visibility into SRE across the entire organization. So, I spoke with our own Jonathan Schwietert, SRE Team Lead, for more information on how to hold SRE council meetings and the council’s overall importance to our entire team.
The SRE Council’s Mission: “Provide an avenue to direct the hunger that VictorOps has for reliability.”
The SRE council is a group of individuals from different teams (QA, mobile, web-client, middle tier, platform, etc.) who surface topics regarding overall system reliability pertinent to their specific engineering disciplines. The SRE council works to provide a unified vision for product reliability across every team at VictorOps.
By cross-functionally surfacing concerns and discussing areas for improvement, SRE council members can work with their respective teams to build reliability into the corresponding areas, benefitting system-wide reliability. The SRE council’s cross-team collaboration endorses our DevOps culture while building workflows to address reliability in product development.
Jonathan mentioned that the primary goal of SRE council meetings is simply to maintain the forward momentum for SRE. Because we don’t have a dedicated SRE team at VictorOps, the meetings are a necessary way to check in and make sure the organization continues to make SRE progress. Through these meetings and the action items taken from them, the SRE council bakes reliability into the development process over time.
Another important goal of the SRE council meetings is to ensure SRE input for post-incident reviews. Through these reviews, the SRE council can assure we’re not missing anything through our monitoring and alerting efforts. The SRE council meetings can expose areas for improvement and help us walk away with action items for improving reliability in the system.
The council started off with set members. But, we’ve found that rotating members in and out of the SRE council exposes more people to SRE within our system, and more people continue to buy-in to SRE efforts. Rotating members keeps SRE values fresh, adds new perspectives, and builds momentum toward company-wide buy-in to SRE.
Right now, the SRE council meetings are 45 minutes each week. The agenda is set around reviewing any recent post-incident reviews, current SRE-focused PI objectives, and any other monitoring, alerting, or general reliability concerns. However, we’re constantly readdressing the efficacy of meeting agendas and cadences to optimize our time and efforts.
The idea of the SRE council is fairly unique based off the readings and conversations that Jonathan looked at. After learning from SREs at Netflix, Twitter, Github, and Craftsy, we were able to take some of their principles and apply them to our team.
After all of Jonathan’s research, this ended up taking the form of the SRE council. Our SRE efforts are structured, but we don’t have a fully dedicated SRE team. The SRE council members support reliability in our product through their specific teams without slowing down the speed of development.
At first, the SRE council meeting started very open-ended and collaborative. Discussions between everyone on the council were necessary to determining priorities and initiatives. For instance, conversations needed to take place about SRE handling certain issues or if they fell in the realm of QA. But, after a while, the council meetings needed more structure and direction. So, Jonathan started to better focus the agenda and prioritize topics in order to make SRE council meetings more worthwhile.
Jonathan remains highly dedicated to building the momentum toward integrated SRE. As the company grows, our SRE efforts grow with it. Spending more time collecting data, rotating council members, having individual conversations around SRE, and continuously reassessing the agenda continues to empower SRE at VictorOps.
A passion to continuously experiment, learn, and improve is core to any SRE efforts. For us, the SRE council meetings improve collaboration, give visibility to cross-functional teams, and improve reliability without slowing down development. An SRE structure such as ours may not be best for your specific organization, but we hope you can take some tips from this article when working to organize your own SRE.
Download our free eBook, Build the Resilient Future Faster: Creating a Culture of Reliability, to read the full story of how we foster reliability and collaboration through a culture of DevOps and SRE.