Supporting Long-Lived Pods Using a Simple Kubernetes Webhook


Authors:   Clément Labbe


Supporting long-lived pods using a simple Kubernetes webhook
  • Some applications like distributed caches and batch workers require a long lifespan
  • Slack uses an admission webhook to inject tolerations in pods and a custom service taints nodes with their uptime to support long-lived pods
  • The solution involves a two-sided system and a symbiotic node tainting system
  • Limitations include lack of monitoring tools to measure success
Slack has a long tail of unruly ducks, which are applications that are hard to move from one platform to another


Today's applications strive to boot fast, be stateless, and handle unexpected terminations gracefully. However, some applications like distributed caches can take a while to warm up to a running state, while batch workers would rather avoid being terminated before they're done. At Slack, such applications found their home in Kubernetes thanks to a two-sided system: one one hand an admission webhook injects tolerations in pods to inform their requirement to be long-lived, and on the other hand a custom service taints nodes with their uptime. This results in pods desiring a long life to be scheduled on young nodes less likely to be terminated early. This talk will first describe how to write a simple Kubernetes admission webhook (https://github.com/slackhq/simple-kubernetes-webhook) to inject tolerations in pods, then move onto the symbiotic node tainting system, and end with gotchas and some metrics on how this long-lived pod support is used at Slack.Click here to view captioning/translation in the MeetingPlay platform!