logo

Cortex: How to Run a Rock Solid Multi-Tenant Prometheus

2023-04-21

Authors:   Friedrich Gonzalez, Alan Protasio


Summary

The presentation discusses the reliability and features of Cortex, a project based on Prometheus and designed for Kubernetes.
  • Cortex is designed for Kubernetes and is not a separate project from Prometheus
  • Cortex uses Thanos for reliability and provides limits to ensure reliability
  • Cortex implements vulnerable replication to ensure data is replicated across instances
  • Cortex has upcoming projects such as Gateway, Down Sampling, Federated Rules, and Native Histogram
  • There are plans to improve observability on the Cortex layer for cardinality
The presenter mentions that Cortex was designed for Kubernetes a long time ago, and that it is important to talk to users to understand their perspective on the application. They also emphasize the importance of knowing if users are able to do what they want to do with the application. The presenter shows a slide with different tiers for different users, and explains that limits are set for each tenant to ensure reliability. They also mention that there are upcoming projects for Cortex, such as Gateway and Down Sampling, and that there are plans to improve observability on the Cortex layer for cardinality.

Abstract

Cortex is a CNCF open-source project that provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus. Friedrich will initially introduce Cortex current architecture and project status. Then the core of the talk will be about some resilience strategies and features included in cortex that prevent or reduce failure, so that metrics continue flowing. It will be explained which have been added recently and how operators can use all of them in 2023. The first important feature is the hash-ring and replication factor that ensures that process crashing can be tolerated. There is also the zone aware replication that helps to tolerate zone outages. No less important are the tenant limits that help to control costs and usage for specific tenants. After that there are also the instance limits that prevent single processes from getting overloaded. And finally, there is shuffle sharding that reduces the blast radius of an outage.

Materials: