logo

Operating Prometheus in a Serverless World

2022-05-19

Authors:   Colin Douch


Summary

The presentation discusses the challenges of using Prometheus as the standard for collecting and storing time series in serverless architectures and the need for a better solution.
  • Prometheus assumes that the system lives long enough to be discovered and scraped, that the service is network-enabled, and that the user can do their own aggregation.
  • Prometheus uses a pull-based model for metrics collection, which requires the service to live for at least 5-15 seconds.
  • Exposing things over the network requires the ability to listen on a port, spin up a server, and secure communication with firewall rules and TLS certificates.
  • Prometheus assumes that the user can do their own aggregation, which can be problematic for metrics like request counts.
  • There is a need for a better solution that can handle the challenges of serverless architectures and provide more accurate metrics.
The speaker mentions a situation where Cloudflare had a sudden influx of serverless applications built on top of Cloudflare workers, which led to a need for a solution to support serverless functions. This highlights the challenges of using Prometheus as the standard for collecting and storing time series in serverless architectures.

Abstract

The traditional Prometheus configuration makes several assumptions about the architecture of the systems that it is monitoring that fail to be met in the world of Serverless Architectures. With the increasing adoption of Serverless computing in Distributed Systems architectures, the question then arises of how to achieve the same insight into them that we can achieve with more traditional architectures. In particular, with Timeseries Metrics, the choice is often to choose between substandard upstream offerings (such as the Prometheus Pushgateway), or capitulate to vendor lock-in and utilise a platform provided by your Cloud provider. So if we want to continue to use our existing Prometheus systems, then what choices do we have? This talk will cover the issues around existing solutions, Colin's solution to these issues that is currently in production at Cloudflare, and where we can go in upstream to make the experience better going forward.Click here to view captioning/translation in the MeetingPlay platform!

Materials: