Tammy Butow Chats SRE, Chaos Engineering and How to Train On-Call Teams
In episode 6 of Ship Happens, Tammy Butow joins Benton to talk all about SRE, availability and training on-call teams. Tammy is a Principal Site Reliability Engineer and executive team member at Gremlin, software dedicated completely to chaos engineering. Tammy is an industry leader in SRE and chaos engineering and brings loads of valuable experience from the National Australia Bank, Digital Ocean, Dropbox and Gremlin. Learn from Tammy’s interesting history with skateboarding and how she’s learned to apply these lessons to SRE and software development.
Full show notes:
Maintaining availability and velocity in a highly-regulated environment
Tammy talks about what it takes to maintain available, secure services in a highly-regulated environment. See how teams think about their delivery pipelines and services when applications and infrastructure need to adhere to strict Australian governmental regulations.
Tammy’s path to SRE and the organizational value of SRE
Through personal experiences, Tammy discovers the value of SRE in a very real way. Tammy talks about why this piqued her interest in site reliability engineering and how she made the move from a full-stack engineer to an SRE. She then elaborates on her journey into SRE and talks about how managers and engineers can get organizational buy-in for SRE and show the value of it over time.
How to train on-call teams for incident response
Incident management, real-time response and on-call efficiency are important to Tammy, and should be for SREs everywhere. Tammy will cover actionable tips for training on-call teams and giving on-call responders the tools and resources they need to make on-call suck less. Tammy also dives into her expertise in skateboarding and how some of the things she learned while skateboarding has made her and her teams better.
Chaos engineering and real applications of it
Tammy discusses the topic of chaos engineering and the intentional injection of failure into your systems – so you can learn from it and make your systems more resilient over time. Through tabletop exercises, gamedays, on-call training and post-incident reviews, Tammy shows how teams can improve both people operations and technical operations around incident response