logo

Who Killed My Pod? #Whodunit

2021-10-13

Authors:   Suneeta Mall


Summary

The presentation discusses the investigation into a Kubernetes cluster where pods were getting OOMKilled with error code 137 and the steps taken to identify and mitigate the issue.
  • The investigation began when a new application was deployed onto a self-managed Kubernetes cluster and pods were getting OOMKilled with error code 137.
  • The investigation identified that the process was being killed repeatedly potentially because it was a memory hogger.
  • The investigation found that the process was being killed by the OS kernel and disabling the overcommitment was a temporary solution.
  • The actual fix was to reduce the memory footprint of the application and guarantee the resource quality of service and resource requirements on the pod.
  • The presentation also discussed the different levels of the container runtime and the role of Kubernetes in managing container processes on multiple hosts.
The investigation into the OOMKilled pods was like a crime scene investigation, with the team trying to identify who killed the pod and why. They found that the OS kernel was the culprit and that the application was a memory hogger. Disabling the overcommitment was a temporary solution, but the actual fix was to reduce the memory footprint of the application. This experience highlights the importance of thoroughly testing and profiling applications before deploying them onto a Kubernetes cluster.

Abstract

A few weeks ago, we deployed a brand new thoroughly tested, and profiled application onto a self-managed Kubernetes cluster. Suffice to say, all hell broke loose. The pods were getting OOMKilled with error code 137 left and right. This sparked a massive crime scene investigation and some interesting insights were discovered. In this Kube-CSI [crime scene investigation] episode, we will talk about exactly whodunit, why, and the fix!

Materials:

Post a comment

Related work