Hazardous Defaults: Managing Cardinality and Performance for Your Logging Stack

Conference: KubeCon + CloudNativeCon Europe 2023

2023-04-19

Authors: Derek Cavanaugh, Sara Moore

Summary

The presentation discusses the challenges of managing logs in a distributed system and how Loki, a log aggregation system, can help address these challenges.

Loki is a log aggregation system that can help manage logs in a distributed system
Managing logs in a distributed system can be challenging due to the large number of logs and the need to optimize chunk size
Query parallelization and horizontal scaling can help improve query performance and reduce costs
Monitoring and auditing cardinality is important to ensure system health
Tools like Prometheus and Tempo can also help address similar challenges in observability

The speaker mentions that engineers who use the system can also help with monitoring and identifying issues, jokingly saying 'who needs monitoring when they can just tell you when your stuff's broken'

Abstract

Instrumented systems generate A LOT of data and we are fortunate to have performant open-source tools that help us spelunk through all that telemetry (logs, metrics, traces). Configuring these monitoring and observability tools - so that they themselves are performant and efficient - can be a challenge. For those new or unfamiliar to monitoring and observability, it can be appealing to just ‘roll the defaults’ from a configuration perspective. However, leaving those defaults unexamined can lead to unexpected performance issues; and worse, potential data loss. In this talk, we walk through the basic structure of the PLG-stack (Promtail, Loki and Grafana). We explore some unexpected cardinality (and associated performance) impacts that arise from the default configurations and how we made thoughtful adjustments to address those impacts. Finally, we will lay out a step-by-step guide to give your logging stack some ‘love’ and ensure that you are getting the most out of your tooling.

Materials:

Tags: