logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Ricardo Rocha, Spyros Trigazis
2023-04-20

The Kubernetes infrastructure at CERN runs a variety of workloads, from scientific computing to critical services for campus and our physics accelerator complex. It’s important to offer the features and capabilities our users require, but even more the required high levels of service. In this session we present in detail a recent incident where a rogue maintenance tool deleted a third of our production capacity in minutes, how this resulted in no downtime with only service degradation and how we were able to recover in a short time. We describe our architecture to achieve high service availability, the options we took to reduce blast radius, the concept of “clusters as cattle” and how extensive use of gitops saved the day. We will also describe some lessons learned in the process, the detected cyclic dependencies when recovering from a major outage, and the corner cases where more care is needed for stateful workloads and multi-cluster scheduling. We will demo this on stage showing how real CERN services recover from what would not so long ago be events with a very serious impact. And how the effort from the last years has paid off, with our users responding calmly and positively while going through a major incident.
Authors: Michael Hrivnak, Rajula Vineet Reddy, Francisco Barros, Varsha Prasad Narsing
2023-04-19

tldr - powered by Generative AI

CERN uses the operator pattern to automate and scale delivery of CMS websites, balancing reusability and open source principles against integration with CERN’s specific compute environment and existing infrastructure services.
  • CERN operates 1000+ CMS websites as a SaaS running on Kubernetes
  • The small team used the operator pattern to automate and scale delivery of CMS websites
  • Balancing reusability and open source principles against integration with CERN’s specific compute environment and existing infrastructure services
  • Operator SDK, its best practices, and things to avoid when developing an operator from scratch
  • How Kubernetes enables isolation, multi-tenancy, and resource sharing
  • Automated maintenance and monitoring
Authors: Fernando Barreiro Megino, Lukas Heinrich
2022-05-19

tldr - powered by Generative AI

The presentation discusses the use of Kubernetes in high energy physics data analysis, specifically for batch processing and interactive analysis facilities.
  • Kubernetes is used for batch processing in high energy physics data analysis, allowing for scaling up to hundreds of thousands of cores with minimal failure rates.
  • Kubernetes also enables the use of heterogeneous architectures, such as ARM and GPU resources, for data analysis.
  • Interactive analysis facilities using Jupiter and Dask are also implemented using Kubernetes, allowing for dynamic scaling of resources.
  • The presentation includes anecdotes of successful use of Kubernetes in simulating events on ARM resources and scaling up task clusters for faster data analysis.
Authors: Dejan Golubovic, Daniel Holmberg
2022-05-19

tldr - powered by Generative AI

Machine learning can improve results in studying subatomic particles, and Kubeflow can help run machine learning workloads.
  • Using machine learning can improve results in studying subatomic particles, as demonstrated by the jet energy regression example
  • Kubeflow can help run machine learning workloads
  • Challenges in implementing the demo included finding the correct version of the Triton server image and customizing TensorBoard
  • Possible improvements include profile replication across multiple clusters, making pipelines namespace, and adding limit range resources to profiles