Building and Managing a Centralized ML Platform with Kubeflow at CERN

Conference: KubeCon + CloudNativeCon Europe 2021

Authors: Ricardo Rocha, Dejan Golubovic

Summary

The presentation discusses the use of Kubeflow for machine learning development and deployment, including the challenges faced and solutions implemented.

Kubeflow is a tool for managing the machine learning lifecycle, from preparation to serving
The presentation focuses on the use of Kubeflow for a 3D gun detection model, which had previously required extensive training time
Kubeflow helped reduce the execution time from one hour to 30 seconds for one epoch and from 60 hours to around 30 minutes for the full training
Challenges faced included inconsistent releases and difficulty managing additional packages
The presentation also discusses the use of external clusters for bursting to public clouds, which can provide access to more GPUs and other accelerators

The 3D gun detection model previously required 2.5 days to properly train, but with Kubeflow, the training time was reduced to 30 minutes.

Abstract

CERN’s main mission is to expand human knowledge trying to understand the nature of the universe, and machine learning has been growing as a solution for challenges in different areas of development and operations. Areas where ML is being looked at include particle classification using graph neural networks during reconstruction, 3DGANs for much faster generation of simulation data, or reinforced learning for beam calibration. This session presents a recently introduced centralized service covering most use cases, handling data preparation, model training and serving. How it tries to improve resource usage (especially important when handling scarce resources such as accelerators) by offering different resource types (GPU, vGPU, TPU) for each use case. The session will also describe our journey with Kubeflow, the machine learning platform running on top of Kubernetes, and how we integrated on-premises resources and the different possibilities being looked at to extend to public clouds.

Materials:

Slides

Tags: