Software at Scale 28 - Tammy Butow: Principal SRE, Gremlin

Software at Scale

Content provided by Utsav Shah. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Utsav Shah or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

3y ago 58:17

MP3•Episode home

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on August 05, 2024 18:25 (11d ago)

What now? This series will be checked again in the next day. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Tammy Butow is a Principal SRE at Gremlin, an enterprise Chaos Engineering platform that makes it easy to build more reliable applications in order to prevent outages, innovate faster, and earn customer trust. She’s also the co-founder of Girl Geek Academy, an organization to encourage women to learn technology skills. She previously held IC and management roles in SRE at Dropbox and Digital Ocean.

Apple Podcasts | Spotify | Google Podcasts

In this episode, we talk about reliability engineering and Chaos Engineering. We talk about the growing trend of outages across the internet and their underlying reasons. We explore common themes in outages, like marketing events and lack of budgets/planning, the impact of such outages on businesses like online retailers, and how tools and methodologies from Chaos Engineering and SRE can help.

Highlights

01:00 - Starting as the seventh employee at Gremlin

04:00 - An analysis of recent outages and their root causes.

09:00 - A mindset shift on software reliability

14:00 - If you’re suddenly in charge of the reliability of thousands of MySQL databases, what do you do? How do you measure your own success?

25:00 - Why is it important to know exactly how many nodes your service requires to run reliably?

30:00 - What attracts customers to Chaos Engineering? Do prospects get concerned when they hear "chaos” or “failure as a service”?

43:00 - Regression testing failure in CI/CD

51:00 - Trends of interest in Chaos Engineering over time.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

60 episodes

#Tech #Utsav Shah #Podcasting Education