Measuring K8s Network Performance


Authors:   Kornilios Kourtis


The presentation discusses the importance of tail latency and overhead metrics in performance evaluation, as well as the need for system configuration and multiple experiments to increase confidence in results. The speaker also recommends various tools and resources for performance validation.
  • Tail latency is important as scale grows
  • Consider overhead metrics such as CPU and memory utilization
  • Performance interpretation metrics can help identify botnets
  • System configuration should isolate systems to avoid unwanted interference
  • Multiple experiments increase confidence in results
  • Netperf, cubenet benz, and BPF tools are useful for benchmarking
  • Resources for performance validation include books by Brendan Gregg and the Kernel Pages
During a performance evaluation, the team discovered that native routing had reduced performance compared to tunneling, which was unexpected. They replicated the setup without Cilium and found that the performance problem was still present, leading them to discover that the issue was with virtual electronic device handling rather than Cilium itself. The team contributed modifications to the Linux kernel to address the issue.


Benchmarking is hard. Benchmarking K8s networking doubly so. Measuring the performance of K8s networking is the only reliable means for users to understand the capabilities and limitations of their, often unique, infrastructure. Furthermore, benchmarking allows for informed decisions by quantifying the tradeoffs of different stacks and investigating how performance goals can be met in the most cost-effective way. Yet, it is a hard endeavor. Both the software (from the OS to the application) and the hardware (from the CPU to the NIC) stacks are extremely complicated beasts, rendering results confusing or even misguiding. This talk aims to guide practitioners to properly measure k8s network performance. Specifically we will discuss: - How different workloads and metrics can be used to answer different questions - Setting up and executing benchmarks - Common pitfalls we have encountered in practice, and how to avoid them - Validating and interpreting results