logo

Improving GPU Utilization using Kubernetes

2022-05-18

Authors:   Maulin Patel, Pradeep Venkatachalam


Summary

The presentation discusses the challenges of sharing GPUs in Kubernetes and introduces two solutions: time sharing and multi-instance GPU.
  • Notebooks attached to GPUs waste expensive resources when idle
  • Real-time applications like chat box, vision product search, and product recommendation require latency-sensitive and business-critical solutions
  • Kubernetes allows fractional utilization of CPUs but not GPUs, leading to inefficient allocation
  • Time sharing allows multiple containers to run on a single GPU by allocating time slices fairly to all containers
  • Multi-instance GPU allows multiple containers to share a single GPU by creating multiple virtual GPUs
  • Both solutions address most use cases and workload needs
  • The solution is fully managed by GKE and can be configured through API calls or UI/UX
The presenter explains that in time sharing, if a user requests two slices for their container, they will receive two slices and can run their container on both. However, if they request more than one slice, they will receive a proportionate time slice, which may be useful for fitting in heterogeneous workloads with different memory requirements.

Abstract

Kubernetes supports efficient utilization of resources by enabling applications to request the precise amounts of resources it needs. Unlike fractional requests for CPUs, fractional requests for GPUs are not allowed in Kubernetes. GPU resources requested in the pod manifest must be an integer number. This means one GPU is fully allocated to one container even if the container only needs a fraction of GPU for its workload. Without the support for fractional GPUs, GPU resources are invariably over provisioned leading to a wastage. This is especially true for inference workloads that process a handful of data samples in real-time. To address this limitation, we have developed user-friendly solutions that allow a single GPU to be shared by multiple containers thereby improving utilization of GPUs and saving cost. In this talk, we will show the demos of our solutions and share performance results.Click here to view captioning/translation in the MeetingPlay platform!

Materials: