Real-Time Analytics: Going Beyond Stream Processing with Apache Pinot

Conference: OpenAI + Data Forum 2022

2022-06-21

Authors: Karin Wolok, Rong Rong

Summary

The presentation discusses how to build a dashboard using Kafka and Pinot to analyze Wikipedia data in real-time.

Building a dashboard using Kafka and Pinot to analyze Wikipedia data in real-time
Using Streamlit to create metrics and charts for the dashboard
Demonstrating how to drill down and analyze data using the dashboard
Auto-refreshing the dashboard to display real-time data

The presenter shows a live demo of the dashboard, which displays real-time data on Wikipedia changes and user activity. They demonstrate how to drill down and analyze the data, including identifying the top users and bots making changes. The dashboard is auto-refreshed every few seconds to display the latest data.

Abstract

Apache Kafka forms the backbone of the modern data pipeline and its stream processing capabilities provide insights on events as they arrive, but what if we want to go further than this and execute analytical queries on this real-time data. The OLAP databases used for analytical workloads traditionally executed queries on yesterday's data with query latency in the 10s of seconds. The emergence of real-time analytics has changed all this and the expectation is that we should now be able to run thousand of queries per second on fresh data with query latencies typically seen on OLTP databases. This is where Apache Pinot comes into the picture. Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from streaming sources like Kafka, as well as from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage), and provides a layer of indexing techniques that can be used to maximize the performance of queries. Come to this talk to learn how you can add real-time analytics capability to your data pipeline.

Materials:

Tags: