logo

Tamland: How GitLab.Com Uses Long-Term Monitoring Data For Capacity Forecasting.

2022-10-28

Authors:   Andrew Newdigate


Summary

The importance of retaining long-term metric data and using Python data analytics ecosystem with Prometheus data for capacity planning and other purposes
  • Tamland is a tool used for capacity planning that relies on long-term metric data retention and Python data analytics ecosystem with Prometheus data
  • Retaining long-term metric data is important for answering future questions and can be done with tools like Thanos, Cortex, Mamir, and Timescale DB
  • Python libraries like Prometheus pandas, Prophet, Neural Prophet, and Great Kite can be used for analyzing data and forecasting
  • Timeline, an open-source project available on GitLab, can be used for capacity planning and other purposes like Cloud cost forecasting, security and abuse monitoring, and network monitoring
Tamland monitors about 400 different service resource combinations and is a key input into the weekly engineering planning process. The project relies on caching to speed up fetching historical data from Thanos and is used for short-term monitoring before a service gets rolled into production. The team generates reports on anything that has at least 30 days of data and relies on low traffic utilization during the first 30-day period for new resources. The team is also thinking of opening issues in GitLab directly for automatic alerting.

Abstract

Tamland is a capacity planning tool built by GitLab to provide long-term forecasts of potential capacity issues across the services running GitLab.com. It's built on top of the long-term metric storage capabilities of Thanos, which provides utilization and saturation metric data stretching back over a 1 year period. From this, a predictive forecast model is constructed and used to predict future growth trends across hundreds of saturation points over the coming months. This practical talk demonstrates how we capture long-term metrics data in a scalable way using Thanos, how we use Facebook's Prophet library for building forecast models, and how we integrate this with Jupyter to generate a report complete with visualizations. It discusses the benefits of switching to a data-driven and repeatable approach to capacity planning, as well as some of the practical challenges of building the tool. Tamland is an open-source project and attendees have access to the project source if they're interested in digging deeper into our implementation.

Materials: