Best Stephen Townshend Podcasts (2024)

1
Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand 26:06

3h ago26:06

26:06

This week I'm joined by Karanveer Anand, SRE Technical Program Manager at Google to discuss blameless post-mortems. We cover: 🦅 The recent Crowdstrike outage and their public post-mortem 🚑 When do we do a blameless post-mortem? 😕 How do we do a blameless post-mortem? ✅ How do we make sure action items are followed through? 📰 The power of learning f…

1
Slight Reliability Episode 88 - OpenTelemetry Revisited with Zach Michel 26:51

10d ago26:51

26:51

This week Zach Michel from https://middleware.io/ and I discuss the state of OpenTelemetry and what it means to adopt it. We cover: 🌩️ Achieving observability in a SaaS world 🥫 Context propagation - the magic sauce of OTEL 🚪 The telemetry gateway concept and leveraging the OTEL collector 🪵 The state of OpenTelemetry logging 🫂 Making use of the Open…

1
Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko 35:33

1M ago35:33

35:33

In Episode 80 Niall Murphy talked about the need for SREs to be better at articulating the value of our work. In this episode I'm joined by ex-Googler and Engineering Director (SRE) at Culture Amp Artem Yakimenko about how we might achieve this. We discuss both quantifiable and qualitative approaches including leveraging the untapped data in suppor…

1
Slight Reliability Episode 86 - Evolving SLOs with Dom Finn 25:57

3M ago25:57

25:57

In the world of SRE we constantly talk about defining SLOs, but what about evolving them over time? This week I chat with SRE Tech Lead Dom Finn about just that. We cover the relationship between reliability and user analytics, latency classes as a way to speak SLOs with business stakeholders, the role of NFRs and how the thresholds differ from SLO…

1
Slight Reliability Episode 85 - Feeling SaaSsy 11:08

4M ago11:08

11:08

This week I talk about the impact of SaaS-first technology strategies on the work of an SRE. I pose questions about observability, ownership, on-call, and how much control we have over reliability. You can find the Bleeding Tech blog on Medium: https://medium.com/@stownshend You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentown…

1
Slight Reliability Episode 84 - Clinical Troubleshooting with Dan Slimmon 27:40

5M ago27:40

27:40

This week I chat with Dan Slimmon about applying the approach doctors use to treat patient symptoms during incident response. You can find Dan's blog at https://blog.danslimmon.com/ or connect with him on LinkedIn here: https://www.linkedin.com/in/danslimmon/ You can find the official Slight Reliability podcast website at: https://slightreliability…

1
Slight Reliability Episode 83 - An Unfulfilled Promise with Itiel Shwartz 30:32

6M ago30:32

30:32

This week I hear about all things Kubernetes from Komodor CTO and co-founder Itiel Shwartz. We chat about the promise that was made when Kubernetes first entered the industry, the challenge of getting developers engaged and capable of working in Kubernetes, my hate/hate relationship with Helm but its important contribution to the Kubernetes project…

1
Slight Reliability Episode 82 - CI/CD with Amin Astaneh 25:47

7M ago25:47

25:47

This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his company website…

1
Slight Reliability Episode 81 - Incident Management in Non-Prod Environments 10:09

7M ago10:09

10:09

"Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort …

1
Slight Reliability Episode 80 - What's Been Bugging Niall Murphy 36:45

10M ago36:45

36:45

This week I speak with co-author of the original SRE book + the SRE workbook, and renowned speaker Niall Murphy. We chat about the state of SRE in the current macro-economic climate and how we're not yet doing a very good job at articulating the value of SRE to leaders, the relationship that velocity and reliability have, the value of new features …

1
Slight Reliability Episode 76 - Sampling Distributed Traces with Paige Cruz 45:27

10M ago45:27

45:27

Paige Cruz (from Chronosphere) is back. This week we discuss sampling. What is sampling? Why do it? What kinds of sampling are there? You can check out Chronosphere's cloud native observability platform here: https://chronosphere.io/ You can find Paige on: LinkedIn: https://www.linkedin.com/in/paigerduty/ X: https://twitter.com/paigerduty You can f…

1
Slight Reliability Episode 79 - Incident Story Time with Valeska Victoria 37:51

10M ago37:51

37:51

This week Valeska Victoria returns to share some of her experiences working as an SRE at eBay. We look at the cascading effect of production issues in complex integrated environments (how there's often no single root cause), developer literacy of how infrastructure works, the importance of ownership and accountability of reliability, and much more.…

1
Slight Reliability Episode 78 - Developer Experience with Ankit Jain 32:21

10M ago32:21

32:21

This week I chat with Ankit Jain from aviator.co about developer experience. We define developer experience and developer productivity, and how this applies to SRE. We discuss the growing expectation on developers and how this leads to frustration and burnout. We also explore how to measure developer experience and how to start working to make impr…

1
December 2023 Update 5:07

10M ago5:07

5:07

A brief mid-week update on my changing circumstances and the future of the podcast.By Stephen Townshend

1
Slight Reliability Episode 77 - SRE to DevRel with Liz Fong-Jones 31:53

10M ago31:53

31:53

This week I had the privilege of interviewing Liz Fong-Jones from honeycomb.io about DevRel, Developer Advocacy, and how that applies to SRE. We discuss the difference between Developer Relations (DevRel) and Developer Advocacy, how Liz got into advocacy, how DevRel helps companies and the community, and some tips on how to get traction with SRE pr…

1
Slight Reliability Episode 75 - Enterprise SRE with Steve McGhee 39:00

10M ago39:00

39:00

This week I had the honour of chatting with Steve McGhee (former Google SRE, current Google Reliability Advocate, and co-author of Enterprise Roadmap to SRE). We discuss the evolution of SRE from where it began at Google and how it is being adopted by enterprises around the world now (and why this is happening). We talk about getting leadership sup…

1
Slight Reliability Episode 74 - The Hidden Side of Vendor Lock-In 8:55

10M ago8:55

8:55

This week on Slight Reliability Stephen discusses observability vendor lock-in. What is it? What does OpenTelemetry do to help? What areas are yet to be solved? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: htt…

1
Slight Reliability Episode 73 - Enterprise SLOs with Brian Singer 32:18

11M ago32:18

32:18

This week we sit down and talk about SLOs with CPO and co-founder of Nobl9 Brian Singer. We talk about the importance of reviewing operational effectiveness, getting buy in from leadership, using SLOs to reduce noise, how to implement SLOs within different cultures and structures, the parallels between security and reliability... and much more. You…

1
Slight Reliability Episode 72 - Rapid Incident Response with Valeska Victoria 42:19

11M ago42:19

42:19

This week Stephen chats with Valeska Victoria about her time working as an SRE at eBay. Valeska shares her data driven approach to SRE, having a voice as a less experienced engineer, handling incidents under high pressure, leveraging large language models to rapidly find the information you need during an incident, and much more. You can check out …

1
Slight Reliability Episode 71 - Implementing SRE with Dr. Vlad Ukis 29:25

11M ago29:25

29:25

This week Stephen chats with Dr. Vlad Ukis about his journey discovering, and then implementing SRE practices at Siemens Healthineers (which led to him writing a book). They discuss how the evolution of infrastructure necessitates a shift in how we operate, the power of selling SRE practices, the SRE infrastructure used to build SLOs and reliabilit…

1
Slight Reliability Episode 70 - Meta SRE with Amin Astaneh 42:24

11M ago42:24

42:24

Amin Astaneh (from Certo Modo) is back to discuss his experience working as a production engineer (SRE equivalent) at Meta. Stephen and Amin discuss what it's like interviewing for big tech, "you build it, you own it", different SRE engagement models, SRE at different sizes of organisation, socialising your SRE success as a way to get traction, and…

1
Slight Reliability Episode 69 - Developer to SRE with Praveen Kasam 30:10

12M ago30:10

30:10

This week Stephen talks to Praveen Kasam from Diconium Digital Solutions about how he led SRE transformations. Praveen shares his experience transitioning from development to SRE and how leveraging automation and bringing application knowledge to the ops team provided quick wins. He also covers how he later applied SRE concepts to uplift the wider …

1
Slight Reliability Episode 68 - Dashboards and Modern Observability with Eric Schabell 32:31

12M ago32:31

32:31

This week Stephen asks Eric Schabell (Director of Technical Marketing & Evangelism @ Chronosphere) about how dashboards fit into modern observability. They discuss how untamed observability can lead to unexpectedly high cloud bills, the similarities between dashboards and documentation, the "know > triage > understand" workflow, and much more. You …

1
Slight Reliability Episode 67 - Single Pane of Glass with Jamie Allen and Adam Kinniburgh 34:36

12M ago34:36

34:36

This week Stephen chats with Jamie Allen (Cheif Technologist AWS & SRE @ EPAM Systems) and Adam Kinniburgh (VP Innovation @ SquaredUp) about the concept of a single pane of glass (SPOG) for SRE. Is it performance art or something actionable? Can alerting replace the need for dashboards? And are metrics drowning in the wake of distributed tracing? Y…

1
Slight Reliability Episode 66 - Building Digital Assistants for SRE with Kyle Forster 29:51

1y ago29:51

29:51

This week Stephen brings back Kyle Forster from RunWhen to talk about the purple elephant in the room… “AI”. What makes it GenAI, LLM, Advanced Statistics, or ML? Kyle shares his experience surrounding building AI powered search engines for SRE troubleshooting commands and how to incorporate a (paid) open source community of experts rather than tru…

1
Slight Reliability Episode 65 - The Truth About Incidents with Courtney Nash 41:04

1y ago41:04

41:04

This week Stephen chats with the internet incident librarian herself, Courtney Nash. They explore what Courtney has learned through meta-analysis of the over ten thousands incidents in the Verica Open Incident Database (VOID). They cover why MTTR needs to go in the garbage, joint cognitive systems, the value of looking at near misses and *much* mor…

1
Slight Reliability Episode 64 - Observability During Development with Martin Thwaites 36:18

1y ago36:18

36:18

This week Stephen chats with Martin Thwaites from Honeycomb about how developers can leverage observability to understand what they're building better, solve bugs quicker, and have more time for coding. They also discuss OpenTelemetry (the protocol and semantic conventions), manual versus automatic instrumentation, and how keeping every span of tra…

1
Slight Reliability Episode 63 - The Power of Summary 9:20

1y ago9:20

9:20

Observability is a necessary adaptation to make sense of software systems in the Digital Age, but how can we unlock its power for non-engineer stakeholders (such as executives, product owners, etc)? Perhaps we need a layer of abstraction sitting on top of our detailed observability to get the most out of it. You can find the official Slight Reliabi…

1
Slight Reliability Episode 62 - On-Call with Matt Brown 36:57

1y ago36:57

36:57

This week Stephen chats with former-Google SRE Matt Brown about being on-call. They cover how to up-lift junior engineers so they can be on-call, what a fair on-call schedule looks like, run-books, and much more. As you heard, Matt believes flexibility is key to a healthy on-call rotation. Matt is exploring ideas for improvements to existing toolin…

1
Slight Reliability Episode 61 - SRE VS DevOps VS Platform Eng... (Yawn) 6:07

1y ago6:07

6:07

The internet is full of people who want to tell you about SRE, DevOps, and Platform Engineering and how different and similar they are... and will give you the impression that these things compete with each other. But do they? And is it a helpful question to ask in the first place? You can find the official Slight Reliability podcast website at: ht…

1
Slight Reliability Episode 60 - From Zero to SRE with Amin Astaneh 42:46

1y ago42:46

42:46

In this episode Amin Astaneh from Certo Modo discusses his experience undertaking an SRE transformation over several years. Stephen and Amin cover a lot of ground including making ops work visible, measuring toil, the power of calculating the $ value of work, getting developers on-call, the embedded model for SRE, SLOs, culture change, and a whole …

1
Slight Reliability Episode 59 - Bad API Observability with Sonja Chevre 40:23

1y ago40:23

40:23

In this episode Stephen Townshend and Sonja Chevre from Tyk discuss making APIs observable, and some anti-patterns to avoid. They cover GraphQL, OpenTelemetry and semantic conventions, correlation IDs, observability pipelines, and much more. You can find Sonja on LinkedIn: https://www.linkedin.com/in/sonjachevre/ and Twitter: https://twitter.com/So…

1
Slight Reliability Episode 58 - Tackling Cloud Cost with Harinder Seera 36:54

1y ago36:54

36:54

In this episode Stephen Townshend and Harinder Seera explore how to monitor and manage the cost of cloud. They discuss FinOps as a cultural practice, anti-patterns for implementing in the cloud, keeping cost down through resources, pricing, and architecture... and much more. You can find Harinder on LinkedIn: https://www.linkedin.com/in/harindersee…

1
Slight Reliability Episode 57 - A Tale of Three Conferences 16:10

1y ago16:10

16:10

In this episode Stephen shares his experiences traveling overseas to the UK and Singapore AWS Summit, SREcon APAC, and the internal SquaredUp conference "SqUpCon". You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Slight Reliability artwork on Instagram: https://www.instagram.com/slight_rel…

1
June 9th 2023 Update 1:41

1y ago1:41

1:41

A quick update on Stephen's whereabouts and when the next episode will be released.By Stephen Townshend

1
Slight Reliability Episode 56 - Dashbored 14:06

1+ y ago14:06

14:06

In this episode Stephen discusses the role of dashboards within the context of the Digital Era. What are they *not* appropriate for? What can they help with? What kinds of things are suitable to present? If you want to get involved in the SquaredUp dashboard competition head along to: https://squaredup.com/blog/dashboard-competition/ (everyone who …

1
Slight Reliability Episode 55 - Reflections on KubeCon with Bruce Cullen 40:21

1+ y ago40:21

40:21

This week Bruce Cullen is back to share his experiences from KubeCon + CloudNativeCon 2023 Europe. We chat about OpenTelemetry, green engineering, securing your CI/CD pipeline and much more. Bruce is the Director of Engineering at SquaredUp. You can find him on LinkedIn: https://www.linkedin.com/in/bruce-cullen/ You can find the official Slight Rel…

1
Slight Reliability Episode 54 - Trends in Incident Management with Andy Thurai 32:26

1+ y ago32:26

32:26

In this episode Stephen Townshend chats to Andy Thurai (VP and Principal Analyst at Constellation Research) about Andy's latest report titled "Trends in Incident Management 2023". They chat about "mean time to innocence", status pages, they debate whether AI or ML has real value for incident management, and ponder why anyone would willingly decide …

1
Slight Reliability Episode 53 - DORA Metrics with Tim Wheeler 28:03

1+ y ago28:03

28:03

In this episode Stephen Townshend chats to Tim Wheeler (Director of Engineering Services at SquaredUp) about his work implementing and continually monitoring DORA metrics. They chat about customising each metric to your own unique context, avoiding the weaponisation metrics, the "tools will solve this for me" trap, and much more. The books mentione…

1
Slight Reliability Episode 52 - Double, Double, Toil and Trouble! 9:12

1+ y ago9:12

9:12

In this episode Stephen explores the SRE concept of "toil". What is it? How can we measure it? How do we reduce it? Also in this episode: Can we make non-technology systems observable? (like we do technology ones), and the ineffectiveness of change advisory boards (CAB). Also, Stephen's upcoming attendance at SREcon, AWS Summit, and SLOconf. Shout …

1
Slight Reliability Episode 51 - The reliability.org Community with Anurag Gupta 30:02

1+ y ago30:02

30:02

In this episode Stephen Townshend and Anurag Gupta discuss the new reliability.org community for SREs or reliability engineers to share experiences, ask questions, and find community. They discuss the value of community and sharing your thoughts, collaboration between organisations, vicious versus virtuous cycles for reliability, and much more. You…

1
Slight Reliability Episode 50 - The 50th Episode Special with Bruce Cullen 39:10

1+ y ago39:10

39:10

In this episode Bruce Cullen interviews Stephen Townshend about the past, present, and future of the Slight Reliability podcast. They discuss their shared backgrounds in software testing, the different career paths that testing has opened up, and much more! Bruce is the Director of Engineering at SquaredUp. You can find him on LinkedIn: https://www…

1
Slight Reliability Episode 49 - Implementing Observability in the Real World with Ivan Merrill 38:34

1+ y ago38:34

38:34

In this episode Ivan Merrill from Fiberplane shares his experiences implementing observability within some of the large complex organisations he's worked for in the past. You can find Ivan on LinkedIn: https://www.linkedin.com/in/ivan-merrill-1a05223/ You can find out more about Fiberplane here: https://fiberplane.com/ You can find the official Sli…

1
Slight Reliability Episode 48 - Blind Insight 8:01

1+ y ago8:01

8:01

In this episode I discuss the word "insight" within the context of observability. Is insight something tools can provide? Is it something you can reproduce? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https:/…

1
Slight Reliability Episode 47 - Cloud Dependency Reliability with Jeff Martens and Ryan Duffield 32:45

1+ y ago32:45

32:45

In this episode Stephen Townshend discusses our increased dependency on third party cloud services and what this means for reliability with Jeff Martens and Ryan Duffield from https://metrist.io/. You can find Jeff... On LinkedIn: https://www.linkedin.com/in/jmartens/ On Twitter: https://twitter.com/Jmartens You can find Ryan... On StackOverflow: h…

1
Slight Reliability Episode 46 - Raw Telemetry 10:03

1+ y ago10:03

10:03

In this episode I propose the use of scatterplots of raw data to better understand how our systems are behaviour and what our customers are experiencing. The ideas from this episode come from my time as a performance engineer and working with legends in that space Richard Leeke (https://www.linkedin.com/in/richard-leeke-450448/) and Neil Davies (ht…

1
Slight Reliability Episode 45 - Telemetry Fluency with Paige Cruz 48:37

1+ y ago48:37

48:37

In this episode we discuss uplifting telemetry knowledge within engineering teams to enrich their work (and their lives) with Paige Cruz from Chronosphere. We cover why not to take a chainsaw to your observability in order to cut costs, the dark side of auto-instrumentation, story telling with live data, and much more. The book that Paige recommend…

1
Slight Reliability Episode 44 - Cognitive Overload with Paige Cruz 38:50

1+ y ago38:50

38:50

In this episode we discuss cognitive overload in SRE with Paige Cruz from Chronosphere. We cover both what cognitive load is, what causes it, as well as some potential antidotes and preventative measures. You can check out Chronosphere here: https://chronosphere.io/ You can find Paige on LinkedIn: https://www.linkedin.com/in/paigerduty/ You can fin…

1
Slight Reliability Episode 43 - Beyond Observability 10:14

1+ y ago10:14

10:14

In this episode I discuss my "bigger picture" perspective of what observability needs to be, and why it's important we include business and customer into what we monitor in the Digital Era. The books I highlight in this episode are... Observability Engineering https://www.oreilly.com/library/view/observability-engineering/9781492076438/ Sooner, Saf…

1
Slight Reliability Episode 42 - Reliability Insights with José Velez 36:34

1+ y ago36:34

36:34

In this episode we speak to José Velez from Rely about reliability at scale, a top down approach to SLOs, the potential and limitations of AI and ML in operations, the question of service ownership, utilising the business criticality of services in how we monitor the underlying infrastructure, and much more. You can check out Rely at https://www.re…

Podcasts Worth a Listen

Stephen Townshend Podcasts

Podcasts Worth a Listen

Quick Reference Guide