A Deep Dive on Supporting Multi-Instance GPUs in Containers and Kubernetes

Conference: KubeCon + CloudNativeCon Europe 2021

Authors: Kevin Klues

Summary

The presentation discusses the creation and configuration of MIG devices on NVIDIA GPUs for use in Kubernetes clusters.

NVIDIA GPUs can be partitioned into memory slices and compute slices to create GPU instances.
GPU instances can be further partitioned into compute instances, which share access to the memory of the wrapping GPU instance.
Compute instances can be combined to form MIG devices, which are recognized by CUDA and used to run workloads.
There are specific valid combinations of memory and compute slices that can be used to create MIG devices.
MIG devices can be configured on Kubernetes clusters to provide access to GPU resources for containerized workloads.

The speaker explains that MIG devices can be created by walking from left to right on a diagram of the physical layout of the GPUs and adding devices into the configuration such that no two devices overlap vertically. They note that configurations like striping a single device type across all of the GPUs on a machine are most common, but it is also possible to have a mix of different device types on a single node.

Abstract

MIG (short for Multi-Instance GPU) is a mode of operation in the newest generation of NVIDIA Ampere GPUs. It allows one to partition a GPU into a set of "MIG Devices", each of which appears to the software consuming it as a mini-GPU, with a fixed partition of memory and compute resources. In this talk, we take a deep dive into the details of how we built support for MIG in containers and Kubernetes. You will learn how MIG is made available to containers, what challenges we faced building MIG support for Kubernetes, and how you can use it today. Everything we built is 100% open-source and part of the NVIDIA container toolkit stack and NVIDIA k8s-device-plugin. This talk will conclude with a discussion on best practices around how to distribute MIG devices throughout a Kubernetes cluster, including how to handle the lifecycle of MIG devices on a node.

Materials:

Slides

Tags: