The presentation discusses the migration of infrastructure and application metrics from Stacy to Prometheus at DoorDash, and the challenges and learnings encountered during the process.
- The migration involved over 130 services, 1500 dashboards, and more than 7000 alerts.
- The use of histograms instead of percentiles was a difficult change for engineers to adapt to.
- The instance label is a high cardinality label that needs to be pre-aggregated to reduce volume.
- PromptCare's aggregation gateway was used for some metrics, but push models were limited to special cases.
- Automating the monitoring onboarding process for teams is crucial.
- The migration was completed in one year, resulting in over 27,000 alerts and 2200 dashboards.
- Post-migration, DoorDash ingests over 15 million metrics per second and persists over 10 million metrics per second.
One of the challenges encountered during the migration was teaching engineers about histograms and how to query them through PromQL. The team had to provide education on selecting appropriate buckets and the advantages of accurate aggregations. However, the change was necessary to achieve accurate metrics across instances.