The State and Future of Cloud-Native Model Serving


Authors:   Dan Sun, Theofilos Papapanagiotou


K-Serve is a tool for deploying machine learning models that can handle large language models with billions of parameters. It allows for easy deployment and management of models, as well as the ability to observe and analyze model performance.
  • K-Serve allows for easy deployment and management of machine learning models
  • It can handle large language models with billions of parameters
  • Observation and analysis of model performance is possible with K-Serve
  • The future of K-Serve is to support even larger language models
K-Serve was able to handle a language model with almost 400 gigabytes of data and 176 billion parameters. By using techniques such as tensor parallelism and pipeline parallelism, K-Serve was able to split the file into smaller chunks and distribute the workload across multiple GPUs and nodes. However, challenges such as transfer cost and latency still need to be addressed.


KServe is a cloud-native open source project for serving production ML models built on CNCF projects like Knative and Istio. In this talk, we’ll update you on KServe’s progress towards 1.0, the latest developments, such as ModelMesh and InferenceGraph, and its future roadmap. We’ll discuss the Kubernetes design patterns used in KServe to achieve the core ML inference capability, as well as the design philosophy behind KServe and how it integrates the CNCF ecosystem so you can walk up and down the stack to use features to meet your production model deployment requirements. The well-designed InferenceService interface encapsulates the complexity of networking, lifecycle, server configurations and allows you to easily add serverless capabilities to model servers like TensorFlow Serving, TorchServe, and Triton on CPU/GPU. You can also turn on full service mesh mode to secure your InferenceServices. We’ll walk through different scenarios to show how you can quickly start with KServe and evolve to a production-ready setup with scalability, security, observability, and auto-scaling acceleration using CNCF projects like Knative, Istio, SPIFFE/SPIRE, OpenTelemetry, and Fluid.