logo

Preventing Controller Sprawl From Taking Down Your Cluster - When a Scalable Pattern Stops Being Scalable

2022-10-28

Authors:   Madhu C.S.


Summary

Best practices for managing Kubernetes clusters and extensions
  • Construct dashboards to make important metrics visible and accessible
  • Train teams to understand logs and use them for debugging
  • Have visibility into changes made to the system
  • Work closely with partner teams for writing extensions
  • Read the code to detect bugs and understand the system
The speaker shared their experience of using audit logs as a powerful debugging tool and finding tons of requests made by the cubeless during a case study

Abstract

The vast majority of Kubernetes controllers make use of a WATCH and UPDATE pattern, which is a highly scalable client-pull based pattern. “Highly” does not mean “infinite”, and the spread of this pattern has led to a number of implicit design guarantees that operators build on. In this talk, the Container Orchestration team at Robinhood will cover the exploration of the boundaries of this pattern, how second order effects result in service degradation in production, and best practices for monitoring, detecting, debugging and addressing these issues. With examples drawn from real outages, the team will present lessons learned for organizations of all sizes.

Materials:

Post a comment

Related work