logo

Serving Machine Learning Models at Scale Using KServe

2021-10-14

Authors:   Animesh Singh


Summary

K-Serve is a highly scalable and standard-based model inference platform on Kubernetes for trusted AI. It addresses the challenges of deploying machine learning models in production systems.
  • Deploying machine learning models in production systems is difficult and requires considering the cost of deployment, monitoring, security, and scalability.
  • K-Serve is a solution that addresses these challenges by providing a highly scalable and standard-based model inference platform on Kubernetes for trusted AI.
  • K-Serve integrates with multiple popular model servers in the industry and supports various machine learning frameworks.
  • K-Serve defines a standard inference protocol to provide a unified user experience and easily integrate with multiple model servers.
  • K-Serve addresses scalability limitations by reducing resource overhead and deploying multiple models in one inference service.
Deploying machine learning models in production systems is a challenging task that requires considering various factors such as cost, monitoring, security, and scalability. K-Serve is a solution that addresses these challenges by providing a highly scalable and standard-based model inference platform on Kubernetes for trusted AI. It integrates with multiple popular model servers in the industry and supports various machine learning frameworks. K-Serve defines a standard inference protocol to provide a unified user experience and easily integrate with multiple model servers. To address scalability limitations, K-Serve reduces resource overhead and deploys multiple models in one inference service.

Abstract

KServe (previously known as KFServing) is a serverless open source solution to serve machine learning models. With machine learning becoming more widely adopted in organizations, the trend is to deploy larger numbers of models. Plus, there is an increasing need to serve models using GPUs. As GPUs are expensive, engineers are seeking ways to serve multiple models with one GPU. The KServe community designed a Multi-Model Serving solution to scale the number of models that can be served in a Kubernetes cluster. By sharing the serving container that is enabled to host multiple models, Multi-Model Serving addresses three limitations that the current ‘one model, one service’ paradigm encounters: 1) Compute resources (including the cost for public cloud), 2) Maximum number of pods, 3) Maximum number of IP addresses. 4) Maximum number of services This talk will present the design of Multi-Model Serving, describe how to use it to serve models for different frameworks, and share benchmark stats that demonstrate its scalability.

Materials: