Enabling HPC and ML Workloads with the Latest Kubernetes Job Features


Authors:   Michał Woźniak, Vanessa Sochat


The presentation discusses the Flux Operator and the Portfolio Policy as solutions for managing batch workloads in Kubernetes.
  • Flux Operator is a solution for managing batch workloads in Kubernetes that involves a ResourceManager and a headlessService with fully qualified domain names.
  • Specialized logic can be used to generate something and can be run via an entry point or an isolated pod.
  • The Portfolio Policy is a recent feature in Job Controller that allows for the handling of failed pods based on exit codes and pod conditions.
  • The Portfolio Policy is a list of rules that specify actions for handling failed pods based on exit codes and pod conditions.
  • The presentation also mentions ongoing work on new features such as elastic index job or job set.
The Portfolio Policy was tested by Rescale, who faced a similar problem with inflexibility of setting the backup limit. With the use of the Portfolio Policy, they were able to prevent job failures while keeping the time much shorter.


In this talk, we present the new features in Kubernetes Job API and how they can be used to stand up to challenges of running distributed Batch/AI/HPC workloads at scale, based on real-world experiences from DeepMind and the Flux Operator from Lawrence Livermore National Laboratory. We showcase the Indexed Jobs feature by presenting its production use. First, we demonstrate how it simplifies running parallel workloads which require pod-to-pod communication, including distributed machine learning examples based on its use by DeepMind. Next, we demonstrate the orchestration of HPC workloads using the Flux Operator. Here, we create a "Mini Cluster" within Kubernetes built on top of an indexed job, providing a rich ecosystem for orchestration of batch workloads, related user interfaces, and APIs. We also discuss the challenge of handling pod failures for long-running workloads. We show how Pod Failure Policy can be used to continue job execution despite numerous pod disruptions (caused by events such as node maintenance or preemption), yet reduce costs by avoiding unnecessary pod retries when there are software bugs.


Post a comment

Related work

Authors: Claudia Misale, Daniel Milroy

Authors: Wilfred Spiegelenburg, Peter Bacsko