logo

Smarter Golden Signals!

2023-04-21

Authors:   Venkata Gunapati, Anusha Ragunathan


Abstract

As Platform Engineers & SREs, we love metrics from Kubernetes clusters to understand Platform Health. However, we dislike drowning in alerts on every metric & experiencing alert fatigue. The worst consequence of alert fatigue is not just on-call engineer burn out, but on-call snoozing alerts that could prevent incidents. At Intuit, we needed a smarter way to get alerted on a cluster’s Golden Signals, which are picked from an ocean of metrics. This would help reduce the MTTD during incidents. We wanted to achieve this without the burden of instrumenting cluster components. Observability vendors provide solutions using eBPF instrumentation and AI driven insights on prometheus data, but we wanted to explore open source solutions to achieve the same. In this talk, we explain how we explored numalogic, an open source AIOps anomaly detection engine for Kubernetes. You will learn how to use numalogic on Prometheus metrics to derive baseline behaviors and detect anomalies, without any prior AI/ML experience. We will show how we collect, process and analyze in-cluster data in real time and how numalogic computes anomaly scores for each component, which bubbles up a single anomaly score for the cluster. There will be a live demo of the AIOps based prometheus metrics pipeline in action.

Materials: