Supporting Long-Lived Pods Using a Simple Kubernetes Webhook

Conference: KubeCon + CloudNativeCon Europe 2022

2022-05-18

Authors: Clément Labbe

Summary

Supporting long-lived pods using a simple Kubernetes webhook

Some applications like distributed caches and batch workers require a long lifespan
Slack uses an admission webhook to inject tolerations in pods and a custom service taints nodes with their uptime to support long-lived pods
The solution involves a two-sided system and a symbiotic node tainting system
Limitations include lack of monitoring tools to measure success

Slack has a long tail of unruly ducks, which are applications that are hard to move from one platform to another

Abstract

Today's applications strive to boot fast, be stateless, and handle unexpected terminations gracefully. However, some applications like distributed caches can take a while to warm up to a running state, while batch workers would rather avoid being terminated before they're done. At Slack, such applications found their home in Kubernetes thanks to a two-sided system: one one hand an admission webhook injects tolerations in pods to inform their requirement to be long-lived, and on the other hand a custom service taints nodes with their uptime. This results in pods desiring a long life to be scheduled on young nodes less likely to be terminated early. This talk will first describe how to write a simple Kubernetes admission webhook (https://github.com/slackhq/simple-kubernetes-webhook) to inject tolerations in pods, then move onto the symbiotic node tainting system, and end with gotchas and some metrics on how this long-lived pod support is used at Slack.Click here to view captioning/translation in the MeetingPlay platform!

Materials:

Tags: