logo

Kubernetes as a Substrate for ATLAS Compute

2022-05-19

Authors:   Fernando Barreiro Megino, Lukas Heinrich


Summary

The presentation discusses the use of Kubernetes in high energy physics data analysis, specifically for batch processing and interactive analysis facilities.
  • Kubernetes is used for batch processing in high energy physics data analysis, allowing for scaling up to hundreds of thousands of cores with minimal failure rates.
  • Kubernetes also enables the use of heterogeneous architectures, such as ARM and GPU resources, for data analysis.
  • Interactive analysis facilities using Jupiter and Dask are also implemented using Kubernetes, allowing for dynamic scaling of resources.
  • The presentation includes anecdotes of successful use of Kubernetes in simulating events on ARM resources and scaling up task clusters for faster data analysis.
One example of successful use of Kubernetes in high energy physics data analysis is the simulation of events on ARM resources. While many sites were interested in purchasing ARM resources, no one wanted to be the first to do so. To address this, the team set up an EKS cluster with Graviton 2 nodes and used multi-arc Docker images to generate different versions of the image based on the architecture of the client. This allowed for the first 10,000 events ever simulated on ARM to be generated and compared to events on x86 to ensure proper alignment.

Abstract

The ATLAS experiment at CERN is one of the largest scientific machines built to date and will have ever growing computing needs as it explores higher energy and luminosity proton collisions. Recent R&D on the integration of cloud infrastructures with ATLAS' Worldwide LHC Computing Grid resources identified Kubernetes as a commonly available, ideal substrate. While Kubernetes is widely known for its service management capabilities, it also offers powerful batch controllers for containerised workloads. We exploited these capabilities to build ephemeral batch clusters with over 100k vCPU to process tasks that require quick turnaround, make available GPU resources that are not widely available in our own infrastructure, or create interactive facilities, where users can easily spin up private clusters for their distributed analysis from a notebook.Click here to view captioning/translation in the MeetingPlay platform!

Materials:

Post a comment

Related work