logo

Setting up Etcd with Kubernetes to Host Clusters with Thousands of Nodes

2023-04-20

Authors:   Laurent Bernaille, Marcel Zięba


Summary

The presentation discusses challenges in running large Kubernetes clusters and offers best practices to overcome them. It also highlights the importance of using informers and avoiding list calls to improve performance.
  • Running large Kubernetes clusters is challenging despite community improvements
  • Defaults are not always enough and best practices should be followed
  • Avoid list calls and use informers to improve performance
  • Memory and CPU buffer should be maintained to handle bad events
  • Streaming lists in Kubernetes 1.27 can improve memory usage
The presentation shares an incident where a naive approach to protect against accidental deletion of nodes in a node group resulted in hundreds of calls to etcd, causing performance issues. The issue was resolved by replacing list calls with informers.

Abstract

Setting up clusters that need thousands of nodes can be challenging especially when it comes to etcd architecture and configuration. It’s especially common in use cases like large processing farms for AI/ML/HPC workloads,or in case of internet scale serving applications. In this session you’ll be able to learn best practices around etcd deployments architecture and configuration from tech leads from DataDog and Google Cloud. DataDog has been running their own Kubernetes clusters with thousands of nodes for many years already. Google Cloud has been offering managed clusters up to 15000 nodes since 2020. You’ll be able to hear from practitioners in the space how to squeeze performance, reliability and scale from etcd instances in your clusters. You'll be able to hear about topics like handling disk io or network throughput bottlenecks or how to handle api server restarts and their impact on etcd.

Materials: