Colocate Hadoop YARN with Kubernetes to Save Massive Costs on Big Data

Conference: KubeCon + CloudNativeCon Europe 2023

2023-04-19

Authors: Irvin Lim, Hailin Xiang

Summary

The presentation discusses the implementation of a colocation system for Kubernetes and YARN workloads, as well as the importance of risk classification and release automation in software development. The presentation also emphasizes the need for comprehensive observability through capturing low and high-level metrics and using co-location cost to evaluate the effectiveness of the project.

Implemented a colocation system for Kubernetes and YARN workloads
Importance of risk classification and release automation in software development
Comprehensive observability through capturing low and high-level metrics
Use of co-location cost to evaluate the effectiveness of the project

The speaker mentioned that they heavily invested in release automation for their project, with up to four stages in the release process. They also have a comprehensive risk classification system and strict release policies that all developers must adhere to, which has helped minimize human errors. Additionally, they use co-location cost to evaluate the impact of code location on a particular service, which has greatly helped them identify problems before causing widespread issues to the whole cluster and business.

Abstract

Although containerization enables flexibility for workloads and has better resource utilization than virtual machines. But the resource utilization of a production Kubernetes cluster is still quite low if accumulates by 24 hours, while Big Data workloads stabilize at a high resource utilization level. To address the low resource utilization issue on Kubernetes clusters, the industry would colocate online services and offline jobs in the same cluster usually. But how to ensure offline jobs don't affect the normal running of online services is very tricky. Offline jobs may occupy a lot of L3 caches, consume memory bandwidth, hold critical kernel lock and then affect the error rate and the latency of colocated online services. In this talk, we would share how we customize and extend Linux Kernel, Container Runtime, Kubernetes Scheduler, and Kubelet to improve resource utilization significantly while ensuring online services are running as normal. We would share why default cgroup CFS and memory limits are insufficient in complicated real-world scenarios and how to overcome them. We also would share Kubernetes restrictions on offline job scheduling and how we workaround it to save costs on purchasing computing resources for Big Data.

Materials:

Tags: