The conference presentation discusses two technologies, Mig and GPUdirect RDMA, for efficient use of GPU resources in AI and HPC tasks. Mig allows for splitting one unit of GPU into multiple instances, while GPUdirect RDMA enables efficient distributed processing. The presentation includes a POC result for each technology and highlights some points to consider for Kubernetes testing.
- Mig technology allows for efficient use of GPU resources by splitting one unit of GPU into multiple instances
- GPUdirect RDMA enables efficient distributed processing for deep learning tasks
- POC results show that Mig technology is suitable for model development and inference tasks, while GPUdirect RDMA is suitable for larger scale tasks
- Points to consider for Kubernetes testing are discussed in the presentation
The presentation includes an anecdote about testing distributed training with Mig technology. The team found that distributed training is visible with Mig devices, but the task is only executable when one device is located for each part.