logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Kelly O'Malley
2022-06-22

By 2025 we’re estimated to generate 463 exabytes worth of data every day (weforum). With the advent of big data technologies over the last few years we’re in a better place than ever to make use of this data: build models, create dashboards. Still, 463 exabytes has a lot of zeros - fast compute engines can only get us so far if we can’t get to that data to begin with. Data lakes have been a step in the right direction; however, data lakes love to turn into data swamps. Today’s talk will discuss a solution: Delta Lake. Delta provides an open-source framework on top of a data lake that enables massive scalability while preventing garbage data from breaking downstream systems. We’ll start with the construction of Delta Lake: how it builds on parquet, and how compute engines like Spark, Trino, and Flink can interact with its transaction log to process massive amounts of metadata. We’ll also discuss how that transaction log can be used to “travel through time” while maintaining ACID guarantees on tables backed by Delta. Concerned about bad writes? Delta schema enforcement (and evolution) capabilities can handle that. Finally, we’ll wrap up with what’s coming to Delta Lake in the world of data skipping (after all, the fastest way to process data is to not touch it to begin with).
Authors: Bowen Li, huichao zhao
2022-05-18

tldr - powered by Generative AI

The presentation discusses the design principles and architecture of a cloud-native Spark on Kubernetes platform, highlighting the benefits of cloud and Kubernetes and the need for auto-scaling based on cost-saving and elasticity.
  • Cloud and Kubernetes can solve problems of legacy infrastructure by providing on-demand, elastic, and scalable resources with strong resource isolation and cutting-edge security techniques.
  • Design principles include fully embracing public cloud and cognitive way of thinking, containerization for elasticity and reproducibility, and decoupling compute and storage for independent scaling.
  • The architecture of the cloud-native Spark on Kubernetes platform involves multiple Spark Kubernetes clusters, a Spark service gateway, and a multi-tenant platform with advanced features such as physical isolation and min/max capacity setting.
  • Auto-scaling is necessary for cost-saving and elasticity, and the presentation discusses the design of reactive auto-scaling and its productionization.
  • The platform has been running in production for a year, supporting many business-critical workloads for Apple AML.
Authors: Kevin Kho, Han Wang
2021-10-15

tldr - powered by Generative AI

Fugue is an open-source abstraction layer that allows users to port native Python code to Spark or Dask with minimal code changes, making data science code framework-agnostic and scale-agnostic.
  • Data scientists often find themselves reimplementing the same code to transition from Pandas to Spark when data grows too large for Pandas to handle
  • Fugue solves this problem by providing an abstraction layer that allows users to port native Python code to Spark or Dask with minimal code changes
  • Fugue makes data science code framework-agnostic and scale-agnostic, allowing it to be ported to different execution environments
  • Fugue was demonstrated by showing how to scale data compute from a single machine to a Spark cluster set-up on Kubernetes