Measuring K8s Network Performance

Conference: KubeCon + CloudNativeCon North America 2021

2021-10-15

Authors: Kornilios Kourtis

Summary

The presentation discusses the importance of tail latency and overhead metrics in performance evaluation, as well as the need for system configuration and multiple experiments to increase confidence in results. The speaker also recommends various tools and resources for performance validation.

Tail latency is important as scale grows
Consider overhead metrics such as CPU and memory utilization
Performance interpretation metrics can help identify botnets
System configuration should isolate systems to avoid unwanted interference
Multiple experiments increase confidence in results
Netperf, cubenet benz, and BPF tools are useful for benchmarking
Resources for performance validation include books by Brendan Gregg and the Kernel Pages

During a performance evaluation, the team discovered that native routing had reduced performance compared to tunneling, which was unexpected. They replicated the setup without Cilium and found that the performance problem was still present, leading them to discover that the issue was with virtual electronic device handling rather than Cilium itself. The team contributed modifications to the Linux kernel to address the issue.

Abstract

Benchmarking is hard. Benchmarking K8s networking doubly so. Measuring the performance of K8s networking is the only reliable means for users to understand the capabilities and limitations of their, often unique, infrastructure. Furthermore, benchmarking allows for informed decisions by quantifying the tradeoffs of different stacks and investigating how performance goals can be met in the most cost-effective way. Yet, it is a hard endeavor. Both the software (from the OS to the application) and the hardware (from the CPU to the NIC) stacks are extremely complicated beasts, rendering results confusing or even misguiding. This talk aims to guide practitioners to properly measure k8s network performance. Specifically we will discuss: - How different workloads and metrics can be used to answer different questions - Setting up and executing benchmarks - Common pitfalls we have encountered in practice, and how to avoid them - Validating and interpreting results

Materials:

Tags: