logo

The Challenges Managing a Kubernetes-Based Machine Learning Infrastructure

2022-10-27

Authors:   Keith, Keshi Dai, Yuzhui Liu, Ed Shee


Abstract

Managing a machine learning infrastructure is a great challenge, as its scope covers both common infrastructure tasks – such as cluster management, network, security, container management, and observability – and ML-focused tasks – such as GPU compute, data exploration, distributed training, and model serving. Kubernetes and its prosperous open source ecosystem provides great infrastructure tools (e.g., Knative, Cloud Native Buildpacks, Argo, and Envoy), as well as ML-focused projects (e.g., Kubeflow, KServe, Seldon Core, and KubeRay) that enable infrastructure engineers to build a modern machine learning infrastructure. In this panel, you’ll hear from engineers at Bloomberg, Seldon, and Spotify about how they’re using the Kubernetes ecosystem to provide machine learning infrastructure and their current challenges. Panelists represent a variety of use cases, including end-users and infrastructure providers, as well as both on-prem and cloud-based infrastructures.

Materials: