DoorDash’s Journey From StatsD To Prometheus With 10 Million Metrics/Second

Conference: KubeCon + CloudNativeCon North America 2022

2022-10-28

Authors: Benjamin Raskin, Emma Wang

Summary

The presentation discusses the migration of infrastructure and application metrics from Stacy to Prometheus at DoorDash, and the challenges and learnings encountered during the process.

The migration involved over 130 services, 1500 dashboards, and more than 7000 alerts.
The use of histograms instead of percentiles was a difficult change for engineers to adapt to.
The instance label is a high cardinality label that needs to be pre-aggregated to reduce volume.
PromptCare's aggregation gateway was used for some metrics, but push models were limited to special cases.
Automating the monitoring onboarding process for teams is crucial.
The migration was completed in one year, resulting in over 27,000 alerts and 2200 dashboards.
Post-migration, DoorDash ingests over 15 million metrics per second and persists over 10 million metrics per second.

One of the challenges encountered during the migration was teaching engineers about histograms and how to query them through PromQL. The team had to provide education on selecting appropriate buckets and the advantages of accurate aggregations. However, the change was necessary to achieve accurate metrics across instances.

Abstract

Prometheus and PromQL are widely adopted, and an increasing number of engineering teams are either migrating to use Prometheus metrics or use the Prometheus client libraries from day one. Migrations are difficult and at scale require significant engineering. This high barrier can deter organizations and place roadblocks on the way to becoming wholly instrumented with Prometheus metrics. Organizations also face challenges moving certain use cases from a metrics push model to Prometheus, such as exposing metrics from CI and CD, batch jobs and short running tasks. Using histograms at scale efficiently across many teams all using similar RPC libraries and migrating from a primarily percentile driven set of latency metrics that used to be aggregated centrally can be challenging without the right guidance for developers. Emma and Ben will provide best practices around migrating to Prometheus and share lessons and challenges from DoorDash’s migration journey from StatsD to Prometheus.

Materials:

Tags: