logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Irvin Lim, Hailin Xiang
2023-04-19

tldr - powered by Generative AI

The presentation discusses the implementation of a colocation system for Kubernetes and YARN workloads, as well as the importance of risk classification and release automation in software development. The presentation also emphasizes the need for comprehensive observability through capturing low and high-level metrics and using co-location cost to evaluate the effectiveness of the project.
  • Implemented a colocation system for Kubernetes and YARN workloads
  • Importance of risk classification and release automation in software development
  • Comprehensive observability through capturing low and high-level metrics
  • Use of co-location cost to evaluate the effectiveness of the project
Authors: Kelly O'Malley
2022-06-22

By 2025 we’re estimated to generate 463 exabytes worth of data every day (weforum). With the advent of big data technologies over the last few years we’re in a better place than ever to make use of this data: build models, create dashboards. Still, 463 exabytes has a lot of zeros - fast compute engines can only get us so far if we can’t get to that data to begin with. Data lakes have been a step in the right direction; however, data lakes love to turn into data swamps. Today’s talk will discuss a solution: Delta Lake. Delta provides an open-source framework on top of a data lake that enables massive scalability while preventing garbage data from breaking downstream systems. We’ll start with the construction of Delta Lake: how it builds on parquet, and how compute engines like Spark, Trino, and Flink can interact with its transaction log to process massive amounts of metadata. We’ll also discuss how that transaction log can be used to “travel through time” while maintaining ACID guarantees on tables backed by Delta. Concerned about bad writes? Delta schema enforcement (and evolution) capabilities can handle that. Finally, we’ll wrap up with what’s coming to Delta Lake in the world of data skipping (after all, the fastest way to process data is to not touch it to begin with).
Authors: Yang Che, Yuandong Xie
2021-10-14

tldr - powered by Generative AI

Fluid is an open-source project that provides an efficient and convenient data abstraction for data-intensive tasks in the cloud-native field, solving problems in the separation of storage and computing architecture.
  • Data-intensive tasks face problems in the separation of storage and computing architecture, leading to reduced computing efficiency and huge overhead pressure on the underlying storage system.
  • Fluid provides data affinity scheduling, distributed cache engine acceleration, and multi-source data integration data lake.
  • Fluid's data scheduling accelerates a large number of big data and AI workloads in Alibaba Cloud and Tencent Cloud.
  • Fluid's architecture includes two custom resources, a site and a runtime, and two major components, a controller manager and a scheduler.
  • Fluid's site provides a unified interface for accessing data from IDC and the cloud and can accelerate data access through distributed cache.
  • Fluid's scheduler intelligently schedules jobs to catch nodes and notifies the runtime to prefetch data to a specified node.
  • Fluid's demo shows how to use Fluid to accelerate a machine learning training job and provides automatic expansion mechanisms for distributed cache flow.