logo

Scaling Machine Learning Workflows to Big Data with Fugue

2021-10-15

Authors:   Kevin Kho, Han Wang


Summary

Fugue is an open-source abstraction layer that allows users to port native Python code to Spark or Dask with minimal code changes, making data science code framework-agnostic and scale-agnostic.
  • Data scientists often find themselves reimplementing the same code to transition from Pandas to Spark when data grows too large for Pandas to handle
  • Fugue solves this problem by providing an abstraction layer that allows users to port native Python code to Spark or Dask with minimal code changes
  • Fugue makes data science code framework-agnostic and scale-agnostic, allowing it to be ported to different execution environments
  • Fugue was demonstrated by showing how to scale data compute from a single machine to a Spark cluster set-up on Kubernetes
The demo showed how Fugue can apply a business logic to a small Pandas data frame and then bring it into Spark with minimal code changes. The demo also highlighted the differences between Pandas and Spark and how Fugue can make data science code framework-agnostic and scale-agnostic.

Abstract

Data scientists often use Pandas for data that fits on a single machine, and Spark or Dask for larger datasets that need distributed computing power. What happens though, when the data starts small and then grows too much for Pandas to handle? Data scientists often find themselves reimplementing the same code to transition to Spark. Even code with the same business logic needs two separate implementations. Fugue is an open-source abstraction layer that solves this. In this talk, he'll show how Fugue lets users port native Python code to Spark or Dask with minimal code changes. By using Fugue, data science code will be written in a framework-agnostic and scale-agnostic manner that allows it to be ported to different execution environments. This will be demonstrated by showing how to scale data compute from a single machine to a Spark cluster set-up on Kubernetes.

Materials:

Post a comment

Related work


Conference:  Transform X 2021
Authors: Stephen Balaban
2021-10-07


Conference:  ContainerCon 2022
Authors: Shubham Jain, Neha Gupta
2022-06-23