Graphics Processing Units (GPUs) are used for training artificial intelligence and deep learning models, particularly those related to ML inference use cases. However, using GPUs to deploy models at scale can create several challenges for ML practitioners. In this session, Varun Mohan, CEO and Co-Founder of Exafunction, shared the best practices he’s learned to build an architecture that optimizes GPUs for deep learning workloads. Mohan explained the advantages for using GPUs for ML deployment, as well as where they might not have as many benefits. Mohan also discussed cost, memory, and other factors in the GPU-vs-CPU equation. He discussed inefficiencies that may arise in different scenarios and some of the issues related to network bandwidth and egress. Mohan offered techniques, including the importance of batching workloads and optimizing your models, to solve these problems. Finally, he discussed how some companies are using GPUs to run their recommendation and serving systems. Before Exafunction, Mohan was a technical lead and senior manager at Nuro, where he saw the power of deep learning and the large challenges of productionizing it at scale.