145 subscribers
Go offline with the Player FM app!
Next-Gen Data Modeling, Integrity, and Governance with YODA
Manage episode 357219000 series 2355972
In this episode, Kris interviews Doron Porat, Director of Infrastructure at Yotpo, and Liran Yogev, Director of Engineering at ZipRecruiter (formerly at Yotpo), about their experiences and strategies in dealing with data modeling at scale.
Yotpo has a vast and active data lake, comprising thousands of datasets that are processed by different engines, primarily Apache Spark™. They wanted to provide users with self-service tools for generating and utilizing data with maximum flexibility, but encountered difficulties, including poor standardization, low data reusability, limited data lineage, and unreliable datasets.
The team realized that Yotpo's modeling layer, which defines the structure and relationships of the data, needed to be separated from the execution layer, which defines and processes operations on the data.
This separation would give programmers better visibility into data pipelines across all execution engines, storage methods, and formats, as well as more governance control for exploration and automation.
To address these issues, they developed YODA, an internal tool that combines excellent developer experience, DBT, Databricks, Airflow, Looker and more, with a strong CI/CD and orchestration layer.
Yotpo is a B2B, SaaS e-commerce marketing platform that provides businesses with the necessary tools for accurate customer analytics, remarketing, support messaging, and more.
ZipRecruiter is a job site that utilizes AI matching to help businesses find the right candidates for their open roles.
EPISODE LINKS
- Current 2022 Talk: Next Gen Data Modeling in the Open Data Platform
- Data Mesh 101
- Data Mesh Architecture: A Modern Distributed Data Model
- Watch the video version of this podcast
- Kris Jenkins’ Twitter
- Streaming Audio Playlist
- Join the Confluent Community
- Learn more with Kafka tutorials, resources, and guides at Confluent Developer
- Live demo: Intro to Event-Driven Microservices with Confluent
- Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Chapters
1. Intro (00:00:00)
2. What is Yotpo? (00:02:29)
3. Building an ETL framework based on Spark (00:05:25)
4. What is Apache Spark? (00:10:18)
5. Decoupling the data model (00:15:40)
6. Using data mesh principles (00:18:51)
7. How to address different data personas (00:22:24)
8. What is the "shift left" movement? (00:26:35)
9. How can organizations change the way they treat their data? (00:28:47)
10. Use-cases for tooling and documenting data sets (00:31:01)
11. Schema vs. schema-less (00:32:07)
12. What is YODA? (00:40:07)
13. Takeaways from the conversation with Doron and Liran (00:48:35)
14. It's a wrap! (00:52:45)
265 episodes
Manage episode 357219000 series 2355972
In this episode, Kris interviews Doron Porat, Director of Infrastructure at Yotpo, and Liran Yogev, Director of Engineering at ZipRecruiter (formerly at Yotpo), about their experiences and strategies in dealing with data modeling at scale.
Yotpo has a vast and active data lake, comprising thousands of datasets that are processed by different engines, primarily Apache Spark™. They wanted to provide users with self-service tools for generating and utilizing data with maximum flexibility, but encountered difficulties, including poor standardization, low data reusability, limited data lineage, and unreliable datasets.
The team realized that Yotpo's modeling layer, which defines the structure and relationships of the data, needed to be separated from the execution layer, which defines and processes operations on the data.
This separation would give programmers better visibility into data pipelines across all execution engines, storage methods, and formats, as well as more governance control for exploration and automation.
To address these issues, they developed YODA, an internal tool that combines excellent developer experience, DBT, Databricks, Airflow, Looker and more, with a strong CI/CD and orchestration layer.
Yotpo is a B2B, SaaS e-commerce marketing platform that provides businesses with the necessary tools for accurate customer analytics, remarketing, support messaging, and more.
ZipRecruiter is a job site that utilizes AI matching to help businesses find the right candidates for their open roles.
EPISODE LINKS
- Current 2022 Talk: Next Gen Data Modeling in the Open Data Platform
- Data Mesh 101
- Data Mesh Architecture: A Modern Distributed Data Model
- Watch the video version of this podcast
- Kris Jenkins’ Twitter
- Streaming Audio Playlist
- Join the Confluent Community
- Learn more with Kafka tutorials, resources, and guides at Confluent Developer
- Live demo: Intro to Event-Driven Microservices with Confluent
- Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Chapters
1. Intro (00:00:00)
2. What is Yotpo? (00:02:29)
3. Building an ETL framework based on Spark (00:05:25)
4. What is Apache Spark? (00:10:18)
5. Decoupling the data model (00:15:40)
6. Using data mesh principles (00:18:51)
7. How to address different data personas (00:22:24)
8. What is the "shift left" movement? (00:26:35)
9. How can organizations change the way they treat their data? (00:28:47)
10. Use-cases for tooling and documenting data sets (00:31:01)
11. Schema vs. schema-less (00:32:07)
12. What is YODA? (00:40:07)
13. Takeaways from the conversation with Doron and Liran (00:48:35)
14. It's a wrap! (00:52:45)
265 episodes
All episodes
×
1 Apache Kafka 3.5 - Kafka Core, Connect, Streams, & Client Updates 11:25

1 How to use Data Contracts for Long-Term Schema Management 57:28

1 Next-Gen Data Modeling, Integrity, and Governance with YODA 55:55

1 Migrate Your Kafka Cluster with Minimal Downtime 1:01:30

1 Real-Time Data Transformation and Analytics with dbt Labs 43:41

1 What can Apache Kafka Developers learn from Online Gaming? 55:32

1 Apache Kafka 3.4 - New Features & Improvements 5:13

1 How to use OpenTelemetry to Trace and Monitor Apache Kafka Systems 50:01

1 What is Data Democratization and Why is it Important? 47:27

1 Git for Data: Managing Data like Code with lakeFS 30:42

1 Using Kafka-Leader-Election to Improve Scalability and Performance 51:06

1 Real-Time Machine Learning and Smarter AI with Data Streaming 38:56
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.