logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Kelly O'Malley
2022-06-22

By 2025 we’re estimated to generate 463 exabytes worth of data every day (weforum). With the advent of big data technologies over the last few years we’re in a better place than ever to make use of this data: build models, create dashboards. Still, 463 exabytes has a lot of zeros - fast compute engines can only get us so far if we can’t get to that data to begin with. Data lakes have been a step in the right direction; however, data lakes love to turn into data swamps. Today’s talk will discuss a solution: Delta Lake. Delta provides an open-source framework on top of a data lake that enables massive scalability while preventing garbage data from breaking downstream systems. We’ll start with the construction of Delta Lake: how it builds on parquet, and how compute engines like Spark, Trino, and Flink can interact with its transaction log to process massive amounts of metadata. We’ll also discuss how that transaction log can be used to “travel through time” while maintaining ACID guarantees on tables backed by Delta. Concerned about bad writes? Delta schema enforcement (and evolution) capabilities can handle that. Finally, we’ll wrap up with what’s coming to Delta Lake in the world of data skipping (after all, the fastest way to process data is to not touch it to begin with).