The presentation discusses the investigation into a Kubernetes cluster where pods were getting OOMKilled with error code 137 and the steps taken to identify and mitigate the issue.
- The investigation began when a new application was deployed onto a self-managed Kubernetes cluster and pods were getting OOMKilled with error code 137.
- The investigation identified that the process was being killed repeatedly potentially because it was a memory hogger.
- The investigation found that the process was being killed by the OS kernel and disabling the overcommitment was a temporary solution.
- The actual fix was to reduce the memory footprint of the application and guarantee the resource quality of service and resource requirements on the pod.
- The presentation also discussed the different levels of the container runtime and the role of Kubernetes in managing container processes on multiple hosts.