logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Ricardo Rocha, Spyros Trigazis
2023-04-20

The Kubernetes infrastructure at CERN runs a variety of workloads, from scientific computing to critical services for campus and our physics accelerator complex. It’s important to offer the features and capabilities our users require, but even more the required high levels of service. In this session we present in detail a recent incident where a rogue maintenance tool deleted a third of our production capacity in minutes, how this resulted in no downtime with only service degradation and how we were able to recover in a short time. We describe our architecture to achieve high service availability, the options we took to reduce blast radius, the concept of “clusters as cattle” and how extensive use of gitops saved the day. We will also describe some lessons learned in the process, the detected cyclic dependencies when recovering from a major outage, and the corner cases where more care is needed for stateful workloads and multi-cluster scheduling. We will demo this on stage showing how real CERN services recover from what would not so long ago be events with a very serious impact. And how the effort from the last years has paid off, with our users responding calmly and positively while going through a major incident.
Authors: Kerim Satirli, Julie Gunderson
2022-05-19

Site Reliability Engineering (SRE) treats reliability as a software problem, but it really is an organizational problem that requires a different mindset. When the reliability of our service drops, so does our ability to create value for the organization we represent. In this talk, Julie and Kerim will take the audience on a guided journey, starting with how to determine if and how workloads are misbehaving and ending with practical approaches to improve reliability. Through simulated outages (of all types!), observability, and analysis, Julie and Kerim will show attendees how to catch and prepare for service disruptions. Going beyond deployments, attendees will also learn how to combine OpenTelemetry and OpenTracing to instill reliability into their systems.Click here to view captioning/translation in the MeetingPlay platform!