How We are Dealing with Metrics at Scale on GitLab.com

Conference: KubeCon + CloudNativeCon Europe 2021

Authors: Andrew Newdigate

Summary

The presentation discusses how GitLab.com scaled their monitoring system to support a rapidly growing site and improve the quality of their alerting.

GitLab.com's monitoring system has grown exponentially as the site has grown in size and complexity
The team faced problems with low precision alerting, broken dashboards, and independent configurations for their metric stack
To address these problems, they developed a common monitoring strategy based on key metrics and service level indicators, broke down the application into services and components, and unified their metrics, SLI loading configuration, recording rules, and dashboards into a single source
They also improved the quality of their alerting by setting service level objectives and triggering alerts if an SLI is violating its SLO target
They settled on using a combination of Prometheus and Grafana to automatically generate dashboards and alerting rules based on their key metrics

The team realized that their original approach to alerting was generating a huge number of false positives and had poor precision. They went back to the drawing board and settled on using a combination of Prometheus and Grafana to automatically generate dashboards and alerting rules based on their key metrics.

Abstract

As GitLab.com has grown, the number of metrics generated by the application has grown exponentially. Ensuring our team has good quality dashboards and alerting rules was becoming an ever more challenging task. There’s no worse time than experiencing an outage that you expected to have been warned of, only to find out that the alert had been inoperable for months. As an engineer on the infrastructure team supporting GitLab.com, sometimes it felt, during an incident, that we were drowning in data while at the same time struggling to access the most pertinent indicators of the underlying issue. This talk discusses how we are addressing this problem by building up a catalog of key metrics for each component within our application, and then using this definition to automatically generate beautiful Grafana dashboards, rock-solid alerting rules and high-quality SLA indicators. This talk is primarily aimed at Prometheus users, but the fundamentals could be applied to any other metrics system.

Materials:

Slides

Tags:

How We are Dealing with Metrics at Scale on GitLab.com

Conference: KubeCon + CloudNativeCon Europe 2021

Authors: Andrew Newdigate

Summary

Abstract

Post a comment

Related work