Presentations | Hack Dojo

Sort by:

Table Formats Change Everything (By Not Changing Anything)

Conference: OpenAI + Data Forum 2022

Authors: Ismaël Mejía

2022-06-22

Table Formats like Delta Lake and Apache Iceberg are recent storage specifications to handle slow-changing collections of files in distributed systems. They are rapidly gaining adoption by bringing new superpowers to the data engineering toolkit. In this talk, Ismaël will introduce and explain how table formats work and how features like versioning, schema evolution, time travel, and scalable metadata have positive consequences on many of the systems of the Data+AI ecosystem. From scalable metadata handling to incremental and faster data updates as well as reproducible data for AI training and inference.

Tags:

Show 0 Comments

Automating Airflow Backfills with Marquez

Conference: OpenAI + Data Forum 2022

Authors: Willy Lulciuc

2022-06-21

tldr - powered by Generative AI

Open lineage is a standard for capturing metadata around data processing workflows, which can help with debugging and backfilling. It allows for emitting lineage information through REST calls and has integrations with various tools such as Airflow and Spark.

Open lineage captures metadata around data processing workflows, including information about data sets, schema, job inputs and outputs, and job versions.
This metadata can be emitted through REST calls and stored in the Marquez model, which can be queried using various APIs.
Open lineage can help with debugging by allowing for quick identification of data quality issues and tracking run states.
It can also aid in backfilling by providing information about upstream and downstream dependencies and allowing for full or incremental processing.
Open lineage has integrations with various tools such as Airflow and Spark, making it easy to incorporate into existing workflows.

Tags:

Show 0 Comments

The Data-Centric AI Approach With Andrew Ng

Conference: Transform X 2021

Authors: Andrew Ng

2021-10-07

tldr - powered by Generative AI

The presentation discusses the development of data-centric AI and provides tips for its implementation, with a focus on unstructured data.

Data-centric AI is becoming more widespread and systematic in its approach
Consistent labeling of data is crucial for learning algorithms to work effectively
Error analysis and engineering examples are important for structured data
Data augmentation and noise examples can be useful for unstructured data
Focusing on subsets of data can improve performance

Tags:

Show 0 Comments

Dates

Author

Conferences

Tags

Table Formats Change Everything (By Not Changing Anything)

Automating Airflow Backfills with Marquez

tldr - powered by Generative AI

The Data-Centric AI Approach With Andrew Ng

tldr - powered by Generative AI