This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
…
continue reading
The podcast about Python and the people who make it great
…
continue reading
1
Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach
54:16
54:16
Play later
Play later
Lists
Like
Liked
54:16
Summary Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the appr…
…
continue reading
1
Build Your Second Brain One Piece At A Time
50:10
50:10
Play later
Play later
Lists
Like
Liked
50:10
Summary Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collecti…
…
continue reading
Summary Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his …
…
continue reading
1
Designing A Non-Relational Database Engine
1:16:01
1:16:01
Play later
Play later
Lists
Like
Liked
1:16:01
Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing …
…
continue reading
1
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer
56:23
56:23
Play later
Play later
Lists
Like
Liked
56:23
Summary Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological sol…
…
continue reading
1
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary
50:44
50:44
Play later
Play later
Lists
Like
Liked
50:44
Summary Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technolo…
…
continue reading
1
Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+
55:39
55:39
Play later
Play later
Lists
Like
Liked
55:39
Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pe…
…
continue reading
1
Reconciling The Data In Your Databases With Datafold
58:14
58:14
Play later
Play later
Lists
Like
Liked
58:14
Summary A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold,…
…
continue reading
1
Version Your Data Lakehouse Like Your Software With Nessie
40:55
40:55
Play later
Play later
Lists
Like
Liked
40:55
Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond…
…
continue reading
Summary Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about h…
…
continue reading
1
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development
56:00
56:00
Play later
Play later
Lists
Like
Liked
56:00
Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow …
…
continue reading
1
Using Trino And Iceberg As The Foundation Of Your Data Lakehouse
58:46
58:46
Play later
Play later
Lists
Like
Liked
58:46
Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combinatio…
…
continue reading
1
Data Sharing Across Business And Platform Boundaries
59:55
59:55
Play later
Play later
Lists
Like
Liked
59:55
Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains …
…
continue reading
1
Tackling Real Time Streaming Data With SQL Using RisingWave
56:55
56:55
Play later
Play later
Lists
Like
Liked
56:55
Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continu…
…
continue reading
1
Build A Data Lake For Your Security Logs With Scanner
1:02:38
1:02:38
Play later
Play later
Lists
Like
Liked
1:02:38
Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying…
…
continue reading
1
Modern Customer Data Platform Principles
1:01:33
1:01:33
Play later
Play later
Lists
Like
Liked
1:01:33
Summary Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how…
…
continue reading
1
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel
50:26
50:26
Play later
Play later
Lists
Like
Liked
50:26
Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user ex…
…
continue reading
1
Designing Data Platforms For Fintech Companies
47:56
47:56
Play later
Play later
Lists
Like
Liked
47:56
Summary Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector. Announcements Hello and welcome to the Data Engineer…
…
continue reading
1
Troubleshooting Kafka In Production
1:14:43
1:14:43
Play later
Play later
Lists
Like
Liked
1:14:43
Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he hi…
…
continue reading
1
Adding An Easy Mode For The Modern Data Stack With 5X
56:12
56:12
Play later
Play later
Lists
Like
Liked
56:12
Summary The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X unders…
…
continue reading
1
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack
51:17
51:17
Play later
Play later
Lists
Like
Liked
51:17
Summary If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. And…
…
continue reading
1
Designing Data Transfer Systems That Scale
1:03:57
1:03:57
Play later
Play later
Lists
Like
Liked
1:03:57
Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in …
…
continue reading
1
Addressing The Challenges Of Component Integration In Data Platform Architectures
29:42
29:42
Play later
Play later
Lists
Like
Liked
29:42
Summary Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is faci…
…
continue reading
1
Unlocking Your dbt Projects With Practical Advice For Practitioners
1:16:04
1:16:04
Play later
Play later
Lists
Like
Liked
1:16:04
Summary The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects…
…
continue reading
1
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine
1:07:52
1:07:52
Play later
Play later
Lists
Like
Liked
1:07:52
Summary Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yaha…
…
continue reading
1
Shining Some Light In The Black Box Of PostgreSQL Performance
54:51
54:51
Play later
Play later
Lists
Like
Liked
54:51
Summary Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doi…
…
continue reading
1
Surveying The Market Of Database Products
47:12
47:12
Play later
Play later
Lists
Like
Liked
47:12
Summary Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she…
…
continue reading
1
Defining A Strategy For Your Data Products
1:03:50
1:03:50
Play later
Play later
Lists
Like
Liked
1:03:50
Summary The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the develop…
…
continue reading
1
Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable
1:08:28
1:08:28
Play later
Play later
Lists
Like
Liked
1:08:28
Summary Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems …
…
continue reading
1
Using Data To Illuminate The Intentionally Opaque Insurance Industry
51:58
51:58
Play later
Play later
Lists
Like
Liked
51:58
Summary The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry. Announcements He…
…
continue reading
1
Building ETL Pipelines With Generative AI
51:36
51:36
Play later
Play later
Lists
Like
Liked
51:36
Summary Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with…
…
continue reading
1
Powering Vector Search With Real Time And Incremental Vector Indexes
59:16
59:16
Play later
Play later
Lists
Like
Liked
59:16
Summary The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data. Annou…
…
continue reading
1
Building Linked Data Products With JSON-LD
1:01:30
1:01:30
Play later
Play later
Lists
Like
Liked
1:01:30
Summary A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for buildi…
…
continue reading
1
An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem
1:01:25
1:01:25
Play later
Play later
Lists
Like
Liked
1:01:25
Summary Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this ep…
…
continue reading
1
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library
42:12
42:12
Play later
Play later
Lists
Like
Liked
42:12
Summary Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to elim…
…
continue reading
1
Building An Internal Database As A Service Platform At Cloudflare
1:01:09
1:01:09
Play later
Play later
Lists
Like
Liked
1:01:09
Summary Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low lat…
…
continue reading
1
Harnessing Generative AI For Creating Educational Content With Illumidesk
54:52
54:52
Play later
Play later
Lists
Like
Liked
54:52
Summary Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating e…
…
continue reading
1
Unpacking The Seven Principles Of Modern Data Pipelines
47:02
47:02
Play later
Play later
Lists
Like
Liked
47:02
Summary Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your dat…
…
continue reading
1
Quantifying The Return On Investment For Your Data Team
1:01:52
1:01:52
Play later
Play later
Lists
Like
Liked
1:01:52
Summary As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your compa…
…
continue reading
1
Strategies For A Successful Data Platform Migration
1:09:52
1:09:52
Play later
Play later
Lists
Like
Liked
1:09:52
Summary All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that …
…
continue reading
1
Build Real Time Applications With Operational Simplicity Using Dozer
40:42
40:42
Play later
Play later
Lists
Like
Liked
40:42
Summary Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance…
…
continue reading
1
Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future
54:45
54:45
Play later
Play later
Lists
Like
Liked
54:45
Summary Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on d…
…
continue reading
1
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling
1:12:54
1:12:54
Play later
Play later
Lists
Like
Liked
1:12:54
Summary For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow…
…
continue reading
1
How Data Engineering Teams Power Machine Learning With Feature Platforms
1:03:29
1:03:29
Play later
Play later
Lists
Like
Liked
1:03:29
Summary Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems th…
…
continue reading
1
Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh
50:19
50:19
Play later
Play later
Lists
Like
Liked
50:19
Summary Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful en…
…
continue reading
1
How Column-Aware Development Tooling Yields Better Data Models
46:19
46:19
Play later
Play later
Lists
Like
Liked
46:19
Summary Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process…
…
continue reading
1
Build Better Tests For Your dbt Projects With Datafold And data-diff
48:21
48:21
Play later
Play later
Lists
Like
Liked
48:21
Summary Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt pr…
…
continue reading
1
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service
54:05
54:05
Play later
Play later
Lists
Like
Liked
54:05
Summary A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, …
…
continue reading
1
A Roadmap To Bootstrapping The Data Team At Your Startup
42:31
42:31
Play later
Play later
Lists
Like
Liked
42:31
Summary Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his h…
…
continue reading
1
Keep Your Data Lake Fresh With Real Time Streams Using Estuary
55:50
55:50
Play later
Play later
Lists
Like
Liked
55:50
Summary Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streamin…
…
continue reading