Accelerating High-Performance Machine Learning at Scale in Kubernetes

Conference: KubeCon + CloudNativeCon Europe 2022

2022-05-18

Authors: Alejandro Saucedo, Elena Neroslavskaya

Summary

The presentation covers machine learning acceleration at scale, optimization of models, deployment to Kubernetes, and introduction of production cloud native tooling.

Running ML server locally is important to ensure everything works and debug any issues before deployment to production.
Other resources for CI/CD for production machine learning at scale, production machine learning monitoring, machine learning security, and machine learning ecosystem and operations.
Collaboration with Hugging Face team to access a pre-trained GPT2 model using their Transformers library.
Optimization of the model using ONNX serialization format.
Deployment to Kubernetes cluster after testing locally to ensure it works.
Anecdote about a computationally intensive dungeon crawler game that uses AI model for personalization.

The presentation mentions a dungeon crawler game where users can interact with an AI model to choose their own adventure. The game is computationally intensive and showcases the need for model optimization and acceleration.

Abstract

Identifying the right tools for high-performance production machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. In this industry collaboration we aim to provide a hands-on guide on how practitioners can productionize optimized machine learning models in cloud native ecosystems using production-ready open source frameworks. We will dive into a practical use-case, deploying the renowned GPT-2 NLP machine learning model in Kubernetes leveraging the ONNX Runtime from the Seldon Core Triton server, which will provide us with a scalable production NLP microservice serving the ML model that can power intelligent text generation applications. We will present some of the key challenges currently being faced in the MLOps space, as well as how each of the tools in the stack interoperate throughout the production machine learning lifecycle.Click here to view captioning/translation in the MeetingPlay platform!

Materials:

Tags: