logo

Device Plugins 2.0: How to Build a Driver for Dynamic Resource Allocation

2023-04-19

Authors:   Kevin Klues, Alexey Fomenko


Summary

Overview of building a DRA resource driver for Kubernetes
  • A DRA resource driver consists of a centralized controller and a node-local plugin
  • Communication between the two components can be done through a single all-purpose CRD
  • The controller makes allocation decisions and the plugin advertises available resources
  • The driver needs to define a name, communication strategy, resource types, class parameters, and API access
  • Helper libraries are available to make implementation easier
N/A

Abstract

Dynamic Resource Allocation (DRA) is a new Kubernetes feature that puts resource scheduling in the hands of 3rd-party developers. From an end-users perspective, it moves away from the limited "countable" interface for requesting access to resources (e.g. "nvidia.com/gpu: 2"), providing an API more akin to that of persistent volumes. Using GPUs as an example, DRA unlocks a host of new features without the need for awkward solutions shoehorned on top of the existing device plugin API. These features include: * Controlled GPU Sharing (both within a pod and across pods) * Multiple GPU models per node (e.g. T4 and A100) * Specifying arbitrary constraints for a GPU (min/max memory, device model, etc.) * Dynamic allocation of MIG devices * Dynamic repurposing of a GPU from full to MIG mode * Dynamic repurposing of a GPU for use as Passthrough vs. vGPU * ... the list goes on ... In this talk, you will learn how to build your own resource driver for DRA. This includes details of how to use Kubernetes's in-tree helper libraries for DRA, where to find an example driver to get you started, as well as best-practices for architecting the driver itself. Throughout this talk, we will use our existing NVIDIA and Intel GPU drivers as a guide, concluding with a demo of these drivers in action.

Materials: