Making Sense of Your Vital Signals: The Future of Pod and Containers Monitoring

Conference: KubeCon + CloudNativeCon Europe 2023

2023-04-19

Authors: Peter Hunt, David Porter

Summary

The presentation discusses the implementation of Cap2371 in Kubernetes, which moves workload and container monitoring from CAdvisor to CRI for better performance and a more sensible cluster. The main priority is testing to ensure the accuracy and stability of metrics and prevent regressions.

Cap2371 moves workload and container monitoring from CAdvisor to CRI in Kubernetes for better performance and a more sensible cluster
Testing is the main priority to ensure the accuracy and stability of metrics and prevent regressions
Additional testing is needed for the metric CAdvisor endpoint to ensure accuracy
Coverage is needed to ensure all metrics exist on the node and alerts are not broken
Performance impact is a concern, and the goal is to minimize or eliminate it
Observability helps gain insights into the application platform and debug outages or misbehaving apps

The speaker mentions that the performance impact of moving workload and container monitoring to CRI is a concern, and the goal is to minimize or eliminate it. This is important because the performance impact could affect the user experience and cause frustration. For example, if the stats collection is slow or inaccurate, it could be difficult to debug an outage or misbehaving app. Therefore, it is crucial to thoroughly test the implementation and ensure that there is no or minimal performance impact.

Abstract

It’s critical for users and cluster administrators to understand the health of their containers and pods and be able to monitor them. Despite of the fact that the health monitoring of the cluster is critical, it is still a mystery for many k8s users. How can these signals help to keep the clusters running or pinpoint the issues before it is too late? We will going in depth to describe where those metrics originate, how they are measured, and what components are involved to make this space less complicated. This presentation will outline the full pipeline of how these signals are collected and processed for pods and containers work starting from the cgroups in the linux kernel ending with prometheus metrics and dashboards. We will discuss future work in this space. The kubernetes community is currently ongoing a large effort to move container metrics away from cAdvisor into the container runtime as part of Kubernetes Enhancement 2371, “CRI Pod Container Stats” which aims to move metrics into the container runtime. We will discuss the goals of this effort and how it will impact the monitoring pipeline. This work will unlock new features and improve performance helping users and cluster administrators to be in control of their deployments.

Materials:

Tags: