logo

Story of Correlation: Integrating Thanos Metrics with Observability Signals

2022-05-20

Authors:   Bartłomiej Płotka, Kemal Akkoyun


Summary

The presentation discusses the challenges of troubleshooting complex systems and introduces a tool that automates correlation of alerts and provides useful links to relevant data sources.
  • Troubleshooting complex systems can be challenging, especially when dealing with multiple data sources and queries
  • The presenter introduces a tool that automates correlation of alerts and provides useful links to relevant data sources
  • The tool uses exemplars to trace the root cause of alerts and filter out relevant data
  • The tool is experimental and requires the latest releases of several projects
  • The presenter provides an anecdote of using the tool to troubleshoot a performance issue in a request
The presenter uses an example of troubleshooting a performance issue in a request to illustrate the usefulness of the tool. The tool was able to filter out relevant data using exemplars and trace the root cause of the issue to a specific function that was consuming a lot of CPU time. The presenter was able to fix the issue by deleting the function.

Abstract

The CNCF Incubated Thanos project with the large open-source community continues to push boundaries regarding observability and monitoring using Prometheus-based metrics. Together with the Prometheus community, it improves the metric story for Kubernetes clusters and beyond. Things like improved performance, better scalability, debuggability, security, metrics backfilling and query QoS is only the tip of the iceberg. As we know, observability nowadays comes in many flavours. Bunching them together is not a trivial side, given many shapes and collection points. Aside from metrics, we have logs, traces or even continuous profiling. In this talk, Kemal and Bartek, Thanos maintainers, after a quick overview of Thanos, will explain how Thanos can be integrated with those non-metric observability signals. The audience will learn an example, end-to-end ways to correlate multiple observability backends with Thanos for enhanced observability and monitoring experience.Click here to view captioning/translation in the MeetingPlay platform!

Materials: