logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Jakub Piotr Cłapa, Marcus Edel
2022-06-23

Data sets are the backbone of Machine-Learning (ML), but some are more critical than others. There is a core set of them that researchers use to evaluate machine-learning models as a way to track how ML capabilities are advancing over time. One of the most known is the ImageNet data set, which kicked off the modern ML revolution. There's also Lyft's data set meant to train self-driving cars, etc. Over the years, studies have found that these data sets can contain serious flaws. ImageNet, for example, has several labels that are just flat-out wrong. A mushroom is labeled a spoon, a lion is labeled a monkey, or in the case of the Lyft data set, several cars are not annotated at all. All these datasets have one thing in common; they use a highly error-prone annotation pipeline with little or no quality checks. We worked on an open-source tool that uses and combines novel unsupervised machine-learning pipelines that help annotators and machine-learning engineers to identify and filter out potential label errors. In this talk, we will share our findings on how label errors affect the existing training process, discuss possible implications, and dive into how we leveraged unsupervised learning to filter out annotation errors while looking at real-world examples.