logo

Capacity Scheduling for Elastic Resource Sharing in Kubernetes

2021-10-13

Authors:   Yuan Chen, Alex Wang


Summary

The presentation discusses the elastic quota and job queue components of the Kubernetes scheduler and their compatibility with various workload management systems.
  • The elastic quota and job queue components are part of the Kubernetes scheduler and have been extensively tested.
  • The components are compatible with various workload management systems and can be configured to meet specific needs.
  • The goal is to make the components production-ready and widely adopted.
  • The presentation mentions Alibaba and Apple as early adopters of the components.
  • The components can be used for scheduling multiple jobs at the same time and ensuring that resources are not exceeded.
  • The presentation also discusses the possibility of using the components for nomad-style scheduling and SLA-driven scheduling.
The presenter mentions that Alibaba has already applied the elastic quota and job queue components in production for their cloud services. Apple is also actively investigating how to use the components to support their Kubernetes infrastructure. The presenter also mentions that Baidu is looking at using the components to run their self-driving and AI-based simulation workloads.

Abstract

Kubernetes manages resources capacity across multi-tenants/users/namespaces by allocating a fixed amount of resource quotas to each namespace. It lacks sufficient support of dynamic resource sharing within and across teams and organizations and can result in low cluster utilization. It has become a roadblock to migrating applications from other cluster management platforms (e.g., YARN) to Kubernetes. Qingcan Wang from Alibaba and Yuan Chen from Apple will present their collaborative work on a Kubernetes enhancement to address the issue. Capacity scheduling offers a similar feature to YARN’s capacity scheduler and enables elastic resource sharing to improve cluster utilization in Kubernetes. It supports hierarchical resource groups with guaranteed and maximum resources for dynamic sharing of resources, from CPU, memory, disk to extended resources like GPU. It is seamlessly integrated into Kubernetes as plugins and has been used in large scale production clusters such as Alibaba Cloud.

Materials: