logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Kelly O'Malley
2022-06-22

By 2025 we’re estimated to generate 463 exabytes worth of data every day (weforum). With the advent of big data technologies over the last few years we’re in a better place than ever to make use of this data: build models, create dashboards. Still, 463 exabytes has a lot of zeros - fast compute engines can only get us so far if we can’t get to that data to begin with. Data lakes have been a step in the right direction; however, data lakes love to turn into data swamps. Today’s talk will discuss a solution: Delta Lake. Delta provides an open-source framework on top of a data lake that enables massive scalability while preventing garbage data from breaking downstream systems. We’ll start with the construction of Delta Lake: how it builds on parquet, and how compute engines like Spark, Trino, and Flink can interact with its transaction log to process massive amounts of metadata. We’ll also discuss how that transaction log can be used to “travel through time” while maintaining ACID guarantees on tables backed by Delta. Concerned about bad writes? Delta schema enforcement (and evolution) capabilities can handle that. Finally, we’ll wrap up with what’s coming to Delta Lake in the world of data skipping (after all, the fastest way to process data is to not touch it to begin with).
Authors: Ismaël Mejía
2022-06-22

Table Formats like Delta Lake and Apache Iceberg are recent storage specifications to handle slow-changing collections of files in distributed systems. They are rapidly gaining adoption by bringing new superpowers to the data engineering toolkit. In this talk, Ismaël will introduce and explain how table formats work and how features like versioning, schema evolution, time travel, and scalable metadata have positive consequences on many of the systems of the Data+AI ecosystem. From scalable metadata handling to incremental and faster data updates as well as reproducible data for AI training and inference.