Autoscaling Can Be Reliable: Running Cluster Autoscaler in Prod

Conference: KubeCon + CloudNativeCon Europe 2023

2023-04-20

Authors: Maciek Pytel

Summary

The presentation discusses the reliability of running Cluster Autoscaler in production and provides insights on monitoring and debugging tools.

Cluster Autoscaler's primary job is to ensure that all pods can schedule
Metrics such as pending pod metrics are useful for monitoring Cluster Autoscaler's performance
Cluster Autoscaler should be run on dedicated nodes or on the control plane VMs to prevent issues with scaling down
Testing configurations before using them in production is recommended
Ignoring certain flags can have significant side effects
Auto scaling can vary significantly at scale and should be tested

The speaker, who has been part of the GKE team running thousands of instances of Cluster Autoscaler, recommends running it on dedicated nodes or on the control plane VMs to avoid issues with scaling down. He also warns against ignoring certain flags, such as the ignore 10 flag, which can have significant side effects. Testing configurations before using them in production is also recommended.

Abstract

Cluster-Autoscaler can automatically manage the nodes of your Kubernetes clusters running on any of the 25+ supported cloud providers. Unfortunately there are a lot of things that can go wrong when managing the nodes: nodes failing to boot up, hitting a resource quota, provider running out of capacity, misconfigured pod blocking scale-down and running up the bill, the list goes on. Some of those problems Cluster Autoscaler may be able to handle for you while others may require manual intervention. Maciek has been part of the GKE team running tens of thousands of instances of Cluster Autoscaler for many years. He will share some of his experiences, covering: *What metrics are useful for monitoring Cluster Autoscaler in a single cluster or across a small fleet of clusters. * How to use metrics to quickly identify common issues. * When to look into logs and how to read them. * What are the most common types of issues faced by Cluster Autoscaler issues and which configurations should be avoided. During the talk Maciek will focus on issues that are common to running Cluster Autoscaler on any cloud and not specific to any single provider.

Materials:

Tags: