The presentation discusses the importance of data curation and integration in understanding anomalies in a system. It also highlights the architecture of an operational data platform and the use of a new project called Pneuma for stream processing and analytics.
- Data curation and integration are crucial in understanding anomalies in a system
- An out-of-the-box price analytics feature is available for slicing and dicing information
- Access logs are provided for developers to guide them in debugging
- The architecture of an operational data platform is discussed, which collects information from multiple layers
- Pneuma is a new project for stream processing and analytics, which is language agnostic and easy to use
- The front-end design is based on microservices and auto-instrumentation
- Streaming AOPS is done by streaming data providers
The presentation gives an example of how the journey from an alert to debugging information is made easier through data curation and integration. The presentation also highlights the importance of access logs for developers in guiding them in debugging. Additionally, the presentation discusses the use of Pneuma, a new project for stream processing and analytics, which is language agnostic and easy to use.
Intuit runs ~2500 services on Kubernetes, and being one of the top SAAS companies; operational excellence is a top priority. While considerable effort and cost were spent on all three pillars of observability (metrics, log, events), there was a gap in identifying customer impact and causal service to improve MTTD/R for our services. The challenge is that we need to analyze massive amounts of data generated by these data sources in real-time at scale, which is a beast of a problem. Our talk focuses on using Numaproj (Intuit’s open project) and other CNCF open-source technologies to address this problem. Numaproj includes Numaflow, a stream processing platform, and Numalogic, a collection of ML models. We will show how we collect, process, and analyze data per minute in real-time and how using Numaproj, we compute normalized anomaly scores for every data point. These anomaly scores helped us weed the noises out from the data, provide a high signal-to-noise ratio, and directly create incidents. Today we detect incidents based on the scores (~98% confidence) generated by our AIOps platform. With this solution, Intuit has established a cookie-cutter to reduce its MTTD from over 30 minutes to less than 3 minutes.