logo

Surviving From Endless Issues Coming From 7K+ Kubernetes Clusters - Wanhae Lee & Seok

2022-10-27

Authors:   Seok-yong Hong, Wanhae Lee


Summary

The presentation discusses the development of a detection tool for Kubernetes clusters to identify known issues and automate problem-solving.
  • The tool, called Detect, examines multiple factors to identify problems with Kubernetes clusters.
  • It uses a variety of sources, including Kubernetes, Prometheus, and SSH, to collect data and generate reports.
  • Detect is extensible, allowing users to add or remove rules as needed.
  • The presentation includes a demonstration of how to create a new detector rule for identifying clusters with more than 10,000 parts.
  • The tool is designed to help users manage upgrades and avoid common issues with Kubernetes clusters.
During the presentation, the speaker demonstrated how to use Detect to identify a cluster with more than 12,000 restarted pods and outdated TLS certifications. The tool was able to generate a report with actionable insights for the user.

Abstract

Kakao is the 'mobile life platform' company dedicated to renewing daily lives and the leading player in the mobile messenger market in South Korea. As a member of the private Kubernetes as a Service team at Kakao Corp, we have seen an impressive expansion of the service which was 2K clusters with 20K nodes last year to be a 7K+ clusters with 100K+ nodes. With an unprecedented growing number of the clusters in our service, we have faced several problems never met before. One of them is an ever-growing number of on-call issues that are barely manageable with a DevOps team consisting of a small group of developers. In this session, we are going to reveal the secret of how the small team could successfully survive from endless issues generated from 7K+ Kubernetes clusters. We will also illustrate what tools we have made and why we opensource some of them.

Materials:

Post a comment

Related work


Conference:  Black Hat Asia 2023
Authors: Fyodor Yarochkin, Zhengyu Dong, Vladimir Kropotov, Paul Pajares
2023-05-11


Conference:  Defcon 31
Authors: Dan Borgogno Security engineer @ LATU, Ileana Barrionuevo Security engineer @ NaranjaX
2023-08-01

Authors: Anusha Ragunathan
2022-05-19