Fluid - Build Data Orchestration in Kubernetes

Conference: KubeCon + CloudNativeCon North America 2021

2021-10-14

Authors: Yang Che, Yuandong Xie

Summary

Fluid is an open-source project that provides an efficient and convenient data abstraction for data-intensive tasks in the cloud-native field, solving problems in the separation of storage and computing architecture.

Data-intensive tasks face problems in the separation of storage and computing architecture, leading to reduced computing efficiency and huge overhead pressure on the underlying storage system.
Fluid provides data affinity scheduling, distributed cache engine acceleration, and multi-source data integration data lake.
Fluid's data scheduling accelerates a large number of big data and AI workloads in Alibaba Cloud and Tencent Cloud.
Fluid's architecture includes two custom resources, a site and a runtime, and two major components, a controller manager and a scheduler.
Fluid's site provides a unified interface for accessing data from IDC and the cloud and can accelerate data access through distributed cache.
Fluid's scheduler intelligently schedules jobs to catch nodes and notifies the runtime to prefetch data to a specified node.
Fluid's demo shows how to use Fluid to accelerate a machine learning training job and provides automatic expansion mechanisms for distributed cache flow.

In a real customer AI training case, the training data was relatively large and incomplete, placed on cloud object storage like S3. However, the validation data was sensitive and could not be placed on the cloud, needing to be placed on IDC storage like Save. Fluid's CRD provided a unified view and the ability to accelerate the distributed cache, dynamically moving data from Save to GPU instances on the cloud at the time of training and accelerating data access. When training was not required, data could be migrated to low-cost CPU nodes, avoiding the use of GPUs and network dedicated land.

Abstract

In the cloud-native field, data-intensive tasks such as big data and AI will face many problems in the context of the separation of storage and computing architecture. For example, network IO bottlenecks lead to reduced computing efficiency, and the underlying storage system is under huge overhead pressure. On the other hand, the management of multi-source data is very complicated, which is a challenge for algorithm scientists. In this talk, we introduce an efficient and convenient data abstraction, which abstracts data from storage, provides data affinity scheduling, distributed cache engine acceleration, and multi-source data integration data lake through Fluid. In Alibaba Cloud and Tencent Cloud, a large number of Big data and AI workloads are accelerated through Fluid’s data scheduling.

Materials:

Tags: