logo

73,000 Pods a Day, Lessons From Misadventures In Multi-Tenant

2022-10-26

Authors:   Shane Corbett, Wil Reed


Summary

Lessons learned from misadventures in running a large-scale multi-tenant Kubernetes cluster in production
  • Misapplying Kubernetes concepts to Linux performance rules is a big mistake
  • Thinking in cores can be dangerous, as Linux thinks in time
  • Configuring cores actually converts into time
  • Properly scaling on the right metric can greatly simplify cluster setup and reduce churn
  • Measuring what's going on is necessary to understand best practices for a cluster
  • Prometheus is a good tool for measuring cluster performance
The speaker and his colleague spent over two years learning about Linux kernel performance and developing custom monitoring dashboards to run a large-scale multi-tenant application in production. They discovered that some of the things they thought were best practices were actually holding them back the most. By focusing on the fundamentals and measuring what was going on, they were able to greatly simplify their cluster setup and reduce churn. They also found that Prometheus was a good tool for measuring cluster performance.

Abstract

We spent over two years pouring through 800 page linux kernel performance books, tweaking obscure control plane settings, and developing detailed custom monitoring dashboards so you don’t have to! We found there is a large delta between what we learned in CKA training, and the layer upon layer of hard fought knowledge it takes run a large scale multi-tenant application in production. Join us as we take you through real world findings that took months of research to fully understand, and provide evidence that some of the things we were convinced were best practices, were the very things holding us back the most.

Materials:

Post a comment

Related work