The $100M Problem: How Lyft's Data Platform Prevents ML Failures With Ritesh Varyani At Lyft The Data Engineering Show podcast

Content provided by The Data Bros and The Firebolt Data Bros. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Data Bros and The Firebolt Data Bros or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

The Data Engineering Show »
The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

1d ago 25:46

MP3•Episode home

In this episode of the Data Engineering Show, host Benjamin Wagner sits down with Ritesh Varyani, Staff Software Engineer at Lyft, to explore how the company manages a sophisticated multi-engine data stack serving thousands of engineers, while simultaneously integrating AI across infrastructure and user-facing analytics.

What You'll Learn:

How to architect a polyglot data platform that serves fundamentally different workloads, Spark for ML training and massive parallel processing, Trino for dashboarding and medium-scale ETL, and ClickHouse for sub-second OLAP queries without creating operational chaos
Why unification matters more than expansion: Lyft's 2026 strategy prioritizes consolidating and simplifying the data stack rather than adding new tools, reducing maintenance burden and improving reliability for end users
The dual-layer AI strategy that simultaneously enhances user analytics (semantic layer v2 with AI-native support) while automating platform operations (intelligent job failure diagnosis, adaptive resource allocation, and agentic workflow optimization)
How to fund innovation from the bottom-up: Lyft's model encourages individual engineers to experiment with AI on their own time, prove business value through POCs, and secure leadership buy-in through demonstrated alignment with company strategy
Why vendor selection now includes AI explainability and debuggability as standard RFP requirements, even when AI isn't the primary driver of a purchasing decision
The framework for deciding open-source investment vs. managed services: Prioritize business-critical goals first, then determine whether in-house ownership or vendor solutions accelerate that mission, AI becomes the accelerant, not the decision driver

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Ritesh is a Staff Software Engineer at Lyft, bringing six years of experience architecting and scaling the company's data platform. With a background spanning Microsoft's data and cloud infrastructure, including work on Hadoop, Azure, and SaaS products. Ritesh leads Lyft's critical data systems including Trino, Spark, and ClickHouse. In this episode, Ritesh shares insights on building scalable, AI-native data platforms that serve diverse organizational needs, from batch processing and analytics to real-time marketplace operations. His strategic approach to unifying complex data stacks while integrating AI-driven reliability and user experience improvements provides actionable guidance for data engineers and platform leaders navigating infrastructure modernization at scale.

Quotes

"The goal of our platform is to give our users access to the data as fast as possible so that they can drive the meaning from the data that they are getting and take better data driven decisions." - Ritesh

"We are a Hive format shop. We are going to be moving to other open table formats in the future, but at this point, we are a hive table format." - Ritesh

"Our main goal at this point is primarily understanding how we see the data platform running five years from now, three years from now, and how we are able to future proof it." - Ritesh

"In this world of AI, we should not be falling behind in any way, and bringing AI in the right places within our platform." - Ritesh

"We want to make our semantic layer ready for the AI native side of things so that our teams are able to drive the best meaning possible from the data that they see." - Ritesh

"Big data systems are distributed systems by nature, and where AI can help you is very clearly understand how the patterns are changing and what is a good action to take." - Ritesh

"Rather than thinking of this as an AI versus an open source thing, it's about a question of what work is the most business critical and how do you go 100% behind it." - Ritesh

"Not everybody is working on AI initiatives at this point, but where it makes sense according to our business strategy, if it aligns with it, then obviously we go and invest." - Ritesh

"If you are the one who's going to take on the initiative, probably spend a few hours outside of what you're already working on, and that is how you will discover AI and the tooling for it." - Ritesh

"We are trying to consolidate into a single direction of providing different kinds of models so that you are easily able to integrate and focus on the value you want to provide to your customers." - Ritesh

Resources

Connect on LinkedIn:

Websites:

Lyft

Tools & Platforms:

Apache Spark – Batch processing engine for ML training jobs, large-scale data processing, and GDPR operations
Trino – Query engine for BI dashboarding, ETL workflows, and SQL-based data access
ClickHouse – Columnar database for sub-second query latency and real-time analytics
Amazon S3 – Data lake storage for parquet tables and offline data processing
AWS EKS (Elastic Kubernetes Service) – Kubernetes infrastructure for hosting Spark and Trino
ClickHouse Cloud – Managed ClickHouse offering used by Lyft
Hive Table Format – Current table format for organizing parquet files in S3
Kubernetes Operators – Infrastructure for managing ClickHouse deployments

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so
Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.
Check out our three most downloaded episodes:

65 episodes

#Tech #The Data Bros #The Firebolt Data Bros #Data Engineering #Analytics #Data #Business #Firebolt #Cloud Data #Benjamin Wagner #Computer Science #MBA #Somalia