Automating Airflow Backfills with Marquez

Conference: OpenAI + Data Forum 2022

2022-06-21

Authors: Willy Lulciuc

Summary

Open lineage is a standard for capturing metadata around data processing workflows, which can help with debugging and backfilling. It allows for emitting lineage information through REST calls and has integrations with various tools such as Airflow and Spark.

Open lineage captures metadata around data processing workflows, including information about data sets, schema, job inputs and outputs, and job versions.
This metadata can be emitted through REST calls and stored in the Marquez model, which can be queried using various APIs.
Open lineage can help with debugging by allowing for quick identification of data quality issues and tracking run states.
It can also aid in backfilling by providing information about upstream and downstream dependencies and allowing for full or incremental processing.
Open lineage has integrations with various tools such as Airflow and Spark, making it easy to incorporate into existing workflows.

Open lineage can be particularly helpful in identifying issues with data quality. For example, if a dashboard suddenly shows an unexpected increase in sign-ups or room bookings, it could be due to a bug in the code or an issue with the input data set. With open lineage, it's possible to quickly identify which job version produced and consumed the data set in question, making it easier to troubleshoot and resolve the issue.

Abstract

As a data engineer, backfilling data is an important part of your day-to-day work. But, backfilling interdependent DAGs is time-consuming and often associated with an unpleasant experience. For example, let's say you were tasked with backfilling a few months worth of data. You’re given the start and end date for the backfill that will be used to run an ad-hoc backfilling script that you have painstakingly crafted locally on your machine. As you sip your morning coffee, you kick off the backfilling script, hoping it’ll work, and think to yourself, there must be a better way. Yes, there is, and collecting DAG lineage metadata would be a great start! In this talk, Willy Lulciuc will briefly introduce you to how backfills are handled in Airflow, then discuss how DAG lineage metadata stored in Marquez can be used to automate backfilling DAGs with complex upstream and downstream dependencies.

Materials:

Tags: