The presentation discusses the challenges of sharing GPUs in Kubernetes and introduces two solutions: time sharing and multi-instance GPU.
- Notebooks attached to GPUs waste expensive resources when idle
- Real-time applications like chat box, vision product search, and product recommendation require latency-sensitive and business-critical solutions
- Kubernetes allows fractional utilization of CPUs but not GPUs, leading to inefficient allocation
- Time sharing allows multiple containers to run on a single GPU by allocating time slices fairly to all containers
- Multi-instance GPU allows multiple containers to share a single GPU by creating multiple virtual GPUs
- Both solutions address most use cases and workload needs
- The solution is fully managed by GKE and can be configured through API calls or UI/UX
The presenter explains that in time sharing, if a user requests two slices for their container, they will receive two slices and can run their container on both. However, if they request more than one slice, they will receive a proportionate time slice, which may be useful for fitting in heterogeneous workloads with different memory requirements.