logo

When Prometheus Can’t Take the Load Anymore

Authors:   Liron Cohen


Summary

Comparison of different tools for achieving highly available, long-term scalable Prometheus
  • Riskified started with an architecture of two Prometheus servers, but it was not scalable or highly available
  • They examined three potential tools: Thanos, Cortex, and M3
  • They focused on performance, high availability, cost, and operational complexity
  • M3 was developed by the observability team at Uber and provides a highly available and centralized metrics platform
  • M3 uses a push-based model and has a distributed time series database with built-in replication
  • Advantages of M3 include data residing within the cluster, lower bandwidth costs, and a push-based model system
  • M3 also offers various caching policies to support efficient queries and can manage petabytes of metrics
Riskified started with an architecture of two Prometheus servers, but it was not scalable or highly available. They experienced gaps in data when one server went down or was in the process of rolling update. They needed a solution for scalability, high availability, and long-term storage of data. They examined three potential tools: Thanos, Cortex, and M3, and focused on performance, high availability, cost, and operational complexity. They ultimately chose M3, which was developed by the observability team at Uber and provides a highly available and centralized metrics platform. M3 uses a push-based model and has a distributed time series database with built-in replication. Advantages of M3 include data residing within the cluster, lower bandwidth costs, and a push-based model system. M3 also offers various caching policies to support efficient queries and can manage petabytes of metrics.

Abstract

Riskified started from using a pair of Prometheus servers in each of its clusters, but soon enough, Prometheus couldn’t take the load anymore. Once it happened, the SRE team started to check what is the best tool for Multi, HA, long-term Prometheus. They decided to check Thanos, Cortex, and M3. In this session, Liron will share her outtakes of the different tools - which tool can provide the best performance and High Availability, the most cost-effective, and the easiest to deploy and operate.By the end, you’ll get a better understanding of the different tools and which one is the best solution for your use case.

Materials:

Tags: