logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Ismaël Mejía
2022-06-22

Table Formats like Delta Lake and Apache Iceberg are recent storage specifications to handle slow-changing collections of files in distributed systems. They are rapidly gaining adoption by bringing new superpowers to the data engineering toolkit. In this talk, Ismaël will introduce and explain how table formats work and how features like versioning, schema evolution, time travel, and scalable metadata have positive consequences on many of the systems of the Data+AI ecosystem. From scalable metadata handling to incremental and faster data updates as well as reproducible data for AI training and inference.
Authors: Willy Lulciuc
2022-06-21

tldr - powered by Generative AI

Open lineage is a standard for capturing metadata around data processing workflows, which can help with debugging and backfilling. It allows for emitting lineage information through REST calls and has integrations with various tools such as Airflow and Spark.
  • Open lineage captures metadata around data processing workflows, including information about data sets, schema, job inputs and outputs, and job versions.
  • This metadata can be emitted through REST calls and stored in the Marquez model, which can be queried using various APIs.
  • Open lineage can help with debugging by allowing for quick identification of data quality issues and tracking run states.
  • It can also aid in backfilling by providing information about upstream and downstream dependencies and allowing for full or incremental processing.
  • Open lineage has integrations with various tools such as Airflow and Spark, making it easy to incorporate into existing workflows.
Conference:  Transform X 2021
Authors: Andrew Ng
2021-10-07

tldr - powered by Generative AI

The presentation discusses the development of data-centric AI and provides tips for its implementation, with a focus on unstructured data.
  • Data-centric AI is becoming more widespread and systematic in its approach
  • Consistent labeling of data is crucial for learning algorithms to work effectively
  • Error analysis and engineering examples are important for structured data
  • Data augmentation and noise examples can be useful for unstructured data
  • Focusing on subsets of data can improve performance