logo

Observability In ArgoCD/Rollouts Using Streaming ML For Reducing MTTR

2022-10-27

Authors:   Amit Kalamkar, Vigith Maurice


Summary

Intuit's new platform, NewMapRaj, uses AI-based observability to improve change-related incidents and reduce MTTR and MTTD.
  • NewMapRaj is a Kubernetes native data processing and analytics tool used to derive actionable insights for different areas like operational excellence, cost, and security.
  • Intuit's core principle is innovation, and they invest in Argo to make sure their products are always available and issues are resolved quickly.
  • Change-related incidents were causing one-third of Intuit's incidents, and their MTTR was higher due to disjointed deployment and operational experiences.
  • NewMapRaj integrated AI-based observability into Argo CD and rollouts to add a metrics tab, run a multivariant model, and remove humans from the equation.
  • The AI-based observability is computed in real-time and normalized to a human understandable format.
  • NewMapRaj uses a streaming system that does feature engineering and inferencing, and triggers inline training to discover new applications and configurations.
  • The challenges of real-time streaming include boilerplate code and non-standard code, making it difficult to do quick experimentation and extension.
The demo showed how a service developer made a change in the backend, and the new Canary got deployed as part of the change. The AI-based observability computed the anomaly score, and the automated rollback mitigated the problem.

Abstract

At Intuit one third of P1/P2 outages are caused by a change. As Intuit runs ~2500 services on K8s we need to quickly detect and resolve problems using AIOps. Our talk focuses on how we built a K8s native DAG-based streaming processing platform (Numaflow) and streaming ML platform (Numalogic) which is open-sourced under Numaproj to address this problem. We will show how we collect, process, and analyze in-cluster data in real-time and how our Numalogic computes anomaly scores for each deployment. This DAG-based ML platform has now been adopted by Intuit and helps our ML engineers focus on writing just the inference and pre/post-processing logic while the platform takes care of building the dynamic execution model, retries, buffering between the vertices, back-pressure, conditional-forwarding, and auto-scaling. We will also show how we integrated Observability into Argo CD so users can understand and remediate the behavior induced by change and how this is helping Intuit reduce MTTD/MTTR.

Materials: