logo

Tales from on-Call: Fun with Operating Etcd at Scale

2023-04-19

Authors:   Chao Chen, Geeta Gharpure


Summary

Operational issues and their mitigations in running etcd
  • Database size exceeding
  • Revision divergence
  • Out of memory panic
  • Timeouts due to defrag
  • Oversized requests
One example of a mitigation strategy is to paginate unpaginated list requests to avoid spiky workloads that can cause memory pressure and Quorum loss. Another strategy is to implement a server-side throttler that checks for memory pressure and delays range requests if necessary. Additionally, reducing the frequency of online defrag and keeping the number of keys in etcd small can help mitigate timeouts caused by defrag. Workload issues can also cause oversized requests, which can be addressed by adhering to best practices and using endpoint slices instead of endpoints.

Abstract

Etcd is the backbone of kubernetes cluster. At scale, workloads push etcd to its limits. In this session, engineers from EKS etcd team will share their challenges, experiences and solutions for the issues we see when operating etcd. Topics include handling etcd out of memory condition, managing etcd size quota, detecting and recovering from revision divergence and more. If you want to share notes on etcd oncall shifts or just learn more about etcd operations, this session is for you !

Materials:

Post a comment

Related work