Democratizing Deep Learning at Scale with Horovod

Conference: OpenAI + Data Forum 2022

2022-06-21

Authors: Travis Addair, Nicolas Castet

Summary

The presentation discusses a full stack platform for optimizing data sharing and processing in a cluster environment.

The platform uses the Hadoop Distributed File System (HDFS) and YARN for data storage and processing.
The platform includes a peer-to-peer torrent-like protocol called Left-Right (Letbat) for efficient data transfer.
The platform allows for easy sharing of data sets and includes a search function for finding relevant data.
The platform is still in development and the team is looking for feedback and developers to contribute.

The presenter describes a PhD student who spent a week downloading data sets before the team implemented a client to speed up the process. The platform aims to make data sharing and processing more efficient and user-friendly.

Abstract

Deep learning is pushing the limits of what AI can do: from natural language processing to computer vision and autonomous vehicles. Scaling deep learning to multiple GPUs and multiple machines has become critical to reduce training time and solve ever bigger problems. Horovod is a popular open source framework to distribute and scale the training of TensorFlow, PyTorch, and MXNet models. On the verge of the Horovod's v1.0 release, we look back at Horovod's journey and the lessons learned putting deep learning training in production; from its open source debut in 2017, to its presence in every DL ecosystem since joining the Linux Foundation. We will explain the motivations and key innovations that fueled the development of Horovod and achieved new records in deep learning performance benchmarks. Finally, we'll walk through practical examples to demonstrate how you can scale your models to train on hundreds of GPUs with Horovod, and explain how Horovod fits into production ML workflows running on diverse platforms such as Kubernetes, Spark, Ray, and Slurm.

Materials:

Tags: