The presentation discusses the need for collaboration and standardization in metadata operations for end-to-end data and machine learning platforms.
- The goal is to achieve end-to-end interoperability at scale through collaboration and standardization.
- Practitioners at every stage of the MLOps and DataOps lifecycle should collaborate to come up with standards.
- The creation of bad standards is worse than having no standards at all.
- Standardization should focus on interfaces, metrics, and operational considerations.
- Tools like ml server, seldom core, and kubernetes can help abstract data science from operations.
The speaker uses the example of microservices to illustrate the need for abstraction and standardization in machine learning operations. Just as an Ops person doesn't need to know the exact lines of code in a Django app, they should be able to abstract the data science from the operations through standardized interfaces, metrics, and operational considerations.
Organisations have been growingly adopting and integrating a non-trivial number of different frameworks at each stage of their machine learning lifecycle. Although this has helped reduce time-to-value for real-world AI use-cases, it has come at a cost of complexity and interoperability bottlenecks. Each stage in the end-to-end lifecycle involves different stakeholders that make decisions and perform actions that can modify data and/or ML components with use-case-specific but ever compoinding risks, resulting in a growing need to ensure a minimum-level of metadata is collected, tracked and managed. This becomes growingly important due to the need to ensure relevant overarching compliance requirements, as well as architectural requirements on lineage, auditability, accountability and reproducibility. In this session we will dive into the challenges present in the metadata layer of large-scale systems, as well as tooling, best practices and solutions that can be adopted to tackle these challenges. We will discuss the rise of the metadata management systems, the challenges they have been able solve, as well as critical shortcomings where ecosystem-wide collaboration will be key from tooling-level alignemnt to ensure long-term robustness of these heterogeneous end-to-end platform.