logo

Machine Learning Using Various GPU Technology With Kubeflow.

2022-10-28

Authors:   Jihye Choi


Summary

The conference presentation discusses two technologies, Mig and GPUdirect RDMA, for efficient use of GPU resources in AI and HPC tasks. Mig allows for splitting one unit of GPU into multiple instances, while GPUdirect RDMA enables efficient distributed processing. The presentation includes a POC result for each technology and highlights some points to consider for Kubernetes testing.
  • Mig technology allows for efficient use of GPU resources by splitting one unit of GPU into multiple instances
  • GPUdirect RDMA enables efficient distributed processing for deep learning tasks
  • POC results show that Mig technology is suitable for model development and inference tasks, while GPUdirect RDMA is suitable for larger scale tasks
  • Points to consider for Kubernetes testing are discussed in the presentation
The presentation includes an anecdote about testing distributed training with Mig technology. The team found that distributed training is visible with Mig devices, but the task is only executable when one device is located for each part.

Abstract

Everyone who works in MLOps tends to have a perception that limited cost and GPU is crucial. Kubeflow is a great open source, but it provides very little elements to handle efficient distributed learning through coupling tightly with GPU or by maximizing GPU utilization. 1. A simplified model uses a relatively small amount of GPU, as using the entire GPU capacity is considered as waste of resources. The Multi-Instance GPU applied to the NVIDIA A100 provides a technology that splits one GPU into up to 7 instances, and this presentation shows how to combine this top-notch technology with Kubeflow. 2. As the size of the model increases, distributed training becomes more necessary when using multiple GPU servers for efficiency. GPUDirect RDMA is a high-performance networking technology that directly communicates and processes GPU memory without CPU and system memory intervention. As a result, you can get tried and true experience, which improves GPU utilization and performance in Kubeflow.

Materials: