When Prometheus Can’t Take the Load Anymore

Conference: KubeCon + CloudNativeCon Europe 2021

Authors: Liron Cohen

Summary

Comparison of different tools for achieving highly available, long-term scalable Prometheus

Riskified started with an architecture of two Prometheus servers, but it was not scalable or highly available
They examined three potential tools: Thanos, Cortex, and M3
They focused on performance, high availability, cost, and operational complexity
M3 was developed by the observability team at Uber and provides a highly available and centralized metrics platform
M3 uses a push-based model and has a distributed time series database with built-in replication
Advantages of M3 include data residing within the cluster, lower bandwidth costs, and a push-based model system
M3 also offers various caching policies to support efficient queries and can manage petabytes of metrics

Riskified started with an architecture of two Prometheus servers, but it was not scalable or highly available. They experienced gaps in data when one server went down or was in the process of rolling update. They needed a solution for scalability, high availability, and long-term storage of data. They examined three potential tools: Thanos, Cortex, and M3, and focused on performance, high availability, cost, and operational complexity. They ultimately chose M3, which was developed by the observability team at Uber and provides a highly available and centralized metrics platform. M3 uses a push-based model and has a distributed time series database with built-in replication. Advantages of M3 include data residing within the cluster, lower bandwidth costs, and a push-based model system. M3 also offers various caching policies to support efficient queries and can manage petabytes of metrics.

Abstract

Riskified started from using a pair of Prometheus servers in each of its clusters, but soon enough, Prometheus couldn’t take the load anymore. Once it happened, the SRE team started to check what is the best tool for Multi, HA, long-term Prometheus. They decided to check Thanos, Cortex, and M3. In this session, Liron will share her outtakes of the different tools - which tool can provide the best performance and High Availability, the most cost-effective, and the easiest to deploy and operate.By the end, you’ll get a better understanding of the different tools and which one is the best solution for your use case.

Materials:

Slides

Tags: