Kubernetes For GPU Powered Machine Learning Workloads In Academia - Camille Rodriguez, Canonical & John

Conference: KubeCon + CloudNativeCon North America 2022

2022-10-27

Authors: John-Paul Robinson, Camille Rodriguez

Summary

The presentation discusses the use of Kubernetes (K8s) in research computing, particularly in machine learning operations (mlOps) workflows. The speaker highlights the need for a K8s platform to handle the environmental configuration and workflow integration required by mlOps. The presentation also touches on the challenges of managing different CUDA versions and the need for generous resource provisioning to handle large models in containers.

Kubernetes is being used in research computing, particularly in mlOps workflows
A K8s platform is needed to handle the environmental configuration and workflow integration required by mlOps
Managing different CUDA versions can be challenging
Generous resource provisioning is needed to handle large models in containers

The speaker mentions that their university's researchers represent about 30% of the research revenue at the University of Alabama at Birmingham. They also note that their mascot is a dragon, not an elephant, which is a reference to the state's popular football team. The speaker also discusses the use of high-performance computing clusters and the need to keep data close to the CPU with the highest speed.

Abstract

This talk aims to inform the architects and users of Kubernetes, as well as teams planning to transition for Kubernetes for research purposes, how we designed a high-performing Kubernetes cluster specifically geared towards machine learning and AI workloads. On the architectural side, the use of NVIDIA DGX A100 machines provides unprecedented compute density and performance for those workloads. Those nodes are integrated to the cluster with open-source software. We will also cover our challenges & successes in integrating to other components, such as external CEPH storage, gitlab registry and runners, and SAML authentication. The University of Alabama at Birmingham team will cover how they leverage container-enabled GPUs for their research and development workloads. Research workloads increasingly demand access to ad hoc, GPU-enable compute capacity, with complex software environments to power cloud-native workflows. K8s helps address needs ranging from regular ML training runs to supporting software development via CI pipelines.

Materials:

Tags: