What Kind of CPU is it Anyways? Airbnb's Journey to Heterogeneous Clusters


Authors:   Evan Sheng, David Morrison


Airbnb's journey to heterogeneous clusters and the technical and organizational hurdles they faced in migrating from homogeneous clusters
  • Airbnb migrated from running homogeneous Kubernetes clusters to heterogeneous clusters to improve cost and efficiency
  • Changes were required in almost every part of their infrastructure to support multiple different node types
  • They faced three specific technical and organizational hurdles in this journey
  • Heterogeneous clusters have been instrumental for Airbnb and their team
Airbnb previously ran chef on AWS EC2 where each replica of each service would have its own machine. In 2017 and 2018, they started building OneTouch, an abstraction layer on top of Kubernetes for developers. In 2018 and 2019, they migrated 90% of their 700+ services to Kubernetes. Initially, their clusters were separated out by environment, but they were forced to split these clusters into different cluster types as they hit Kubernetes single node single cluster node size limits. They started with single instance type clusters, but as more specialized workloads started migrating, they required different instance types such as GPUs.


In this talk we describe the technical and organizational hurdles Airbnb needed to overcome to migrate from running "homogeneous" Kubernetes clusters (i.e., clusters in which the majority of nodes are the same type) to "heterogeneous" clusters (i.e., clusters in which pods can be scheduled on a variety of different node types). Why did we make this change? Two reasons: cost and efficiency. Restructuring our clusters to support multiple different node types unlocked the ability to run workloads on the best machines for that workload, not just whatever our "default" happened to be. However, getting to this point wasn't easy. We'll describe in this presentation changes that were required in almost every part of our infrastructure, from changes to the ways we provision and scale clusters all the way down to changes in the API that our customer teams use. We'll also discuss the organizational hurdles that we had to address to build confidence in this new operating model.