Smart Green Computing Cloud Native Operations

Conference: KubeCon + CloudNativeCon North America 2022

2022-10-28

Authors: William Caban, Federico Rossi

Summary

The presentation discusses a proof of concept for goal-driven scheduling and energy optimization in a cluster environment using a policy engine, machine learning, and metrics pipelines.

The goal is to move workloads on a cluster while keeping a certain amount of CO2 emissions.
The solution architecture includes a governance with a policy engine to enforce energy efficiency policies, a scheduler with intelligence, and a metrics pipeline to feed the system with data.
The metrics pipeline includes components such as Kepler, Efficient Power Level Exporter, Telegraph, and XG Boost machine learning model.
The Matrix Proxy component exposes the metrics for consumption by the scheduler.
The presentation includes an anecdote about the challenges of scaling and distributing workload blocks in a Telco world.

The presenter gives an example of a Telco world where a single device such as an antenna can generate around 10 gigabits per second of metrics. Scaling and distributing workload blocks in such a scenario can be challenging, and the solution needs to work for both centralized and highly distributed environments. The presenter suggests that automation can help in such situations, and the system should be designed to allow for easy experimentation and versioning.

Abstract

The European Union and the United States have set up a target of at least 50% - 55% net reduction in greenhouse gas emissions by 2030. But, with the sprawling of the cloud-native workloads and the increased demand for resources: are we doing enough?Many community efforts and open source projects enable the observability of the power consumption from software resources to hardware resources. How can we combine the visibility provided by these tools to achieve the organization's sustainability goals? In this talk, we combine CNCF projects and other open source communities tools to create and continuously improve Machine Learning models for cluster operations. These ML models consider a holistic view of a system: from application runtimes, node metrics, cluster metrics, and network metrics to the tracing of the interactions among the distributed components. These ML models are used for the "smart operations" of the distributed systems aligning to the organization's carbon and power optimization goals.

Materials:

Slides

Tags: