logo

Wildfires, Firefighters and Sustainability - Learnings from Mitigating Kubernetes Fires in the Community

2023-04-20

Authors:   Nabarun Pal, Madhav Jivrajani


Summary

The talk discusses the challenges of maintaining open-source projects, specifically Kubernetes releases, and the importance of sustainability in the community.
  • Kubernetes releases occur every four months and involve a large team of contributors.
  • Fixing issues in releases often requires deep expertise possessed by a small set of people, leading to unsustainable firefighting.
  • The talk provides examples of past incidents and what was learned from them.
  • Improvements and takeaways for maintaining sustainability in open-source projects are discussed.
The talk mentions a recent incident where a regression was found in Kubernetes 1.27 shortly after its release. The regression needed to be fixed in master and Cherry Picked to the release branch, which required meticulous planning and time. This incident highlights the challenges of maintaining open-source projects and the importance of sustainability.

Abstract

Kubernetes releases are a herculean task and involve 1000s of contributors. Maintaining open-source projects can be a fun and rewarding experience. But along with that comes instances where contributors need to firefight challenging situations. And those challenging situations come with every Kubernetes release. And more often than not, fixing those issues requires deep expertise in the project, which is possessed only by a small set of people. However, firefighting in this manner is not sustainable both for the project and the contributors. To give a few examples, Kubernetes 1.24 release was almost at a standstill because of a bug introduced in the Go 1.18 standard library and how missing a few edge cases leads to a release blocking scalability regression. However, this is not the only situation. We keep facing such situations release after release, and pretty sure other projects face the same. In this session, we will recall the above incidents, what we learnt from them, and how you, as CNCF project maintainers, can avoid such situations and ensure sustainability for your contributors. Outline - Introduction and Setting the context (5min) - Why were the releases delayed? (5min) - What went right? (5min) - What could be done better? (5min) - Takeaways (15min)

Materials:

Post a comment

Related work

Authors: Nikhita Raghunath, Kiran Mova
2022-10-26



Authors: Carlos Panato, Adolfo García Veytia
2023-04-20

Authors: Marko Mudrinić, Verónica López González
2023-04-19

Authors: Matei David
2023-04-20