logo

Stateless Collectors For Stateful Data: Scaling Prometheus As a Node Agent

2022-10-28

Authors:   Danny Clark


Summary

The presentation discusses the challenges of scaling Prometheus and offers a solution through a managed service that leverages Prometheus as a node agent.
  • Scaling Prometheus can be challenging due to issues with data aggregation and network failures
  • Existing solutions such as Federation, remote read, and Thanos require manual maintenance and expertise
  • A managed service that leverages Prometheus as a node agent can mitigate scaling issues and separate state and query concerns
  • The service forwards metrics data to a remote back end and leverages Kubernetes resource and Daemon set to achieve the setup
  • Google's Monarch provides the capacity needed to offer a prom ql compatible API and long-term retention of metrics
The speaker mentions a customer who found success through adopting Thanos, one of the existing solutions for scaling Prometheus.

Abstract

prometheus-operator is the de facto standard for running Prometheus on Kubernetes. Yet, its configuration can be complicated and baroque, making it hard to know what is being scraped, or to properly enforce RBAC. Scaling also requires careful thought. However, there are an increasing number of ways to run Prometheus as “stateless”. How can we adopt this to solve these problems? This talk introduces an alternative, operator-based approach for running stateless Prometheus instances on Kubernetes by leveraging Prometheus as a node agent. This prompted rethinking how Prometheus configuration is done today, and led to new, simpler, and more opinionated CRDs. We will discuss trade-offs in the new configuration model and the challenges of running a fleet of node-agent Prometheuses at scale. The hope is this lowers the barrier to entry of managing Prometheus infrastructure, while still supporting features and access controls for enterprise users.

Materials:

Post a comment

Related work