The presentation discusses the use of machine learning to discover vulnerabilities in software dependencies and the limitations of current approaches.
- Continuous vigilance is necessary to identify vulnerabilities in software dependencies
- Machine learning can be used to discover vulnerabilities, but it is not self-sufficient and requires continuous improvement
- Data imbalance can cause bias in machine learning models, and self-training can be used to address this issue
- Discovering vulnerabilities in software dependencies is complex and requires a multi-faceted approach
The presentation highlights the importance of discovering vulnerabilities in software dependencies, which are often overlooked in traditional approaches to software security. The use of machine learning is presented as a promising solution, but it is not without limitations. The presenter emphasizes the need for continuous improvement and vigilance in identifying vulnerabilities, as well as the importance of addressing data imbalance in machine learning models. Overall, the presentation underscores the complexity of securing modern software and the need for a multi-faceted approach.
Software Composition Analysis (SCA) products report vulnerabilities in third-party dependencies by comparing libraries detected in an application against a database of known vulnerabilities. These databases typically incorporate multiple sources, such as bug tracking systems, source code commits, and mailing lists, and must be curated by security researchers to maximize accuracy.
We designed and implemented a machine learning system which features a complete pipeline, from data collection, model training, and prediction on data item, to validation of new models before deployment. The process is executed iteratively to generate better models with newer labels, and it incorporates self-training to automatically increase its training dataset.
The deployed model is used to automatically predict the vulnerability-relatedness of each data item. This allows us to effectively discover vulnerabilities across the open-source library ecosystem.
To help in performance stability, our methodology also includes an additional evaluation step to automatically determine how well the model from a new iteration would fare. In particular, the evaluation helps to see how much it agrees with the old model, while trying to increase metrics such as precision and/or recall.
This is the first study of its kind across a variety of data sources, and our paper was recently awarded the ACM SIGSOFT Distinguished Paper Award at the Mining Software Repositories Conference (MSR) 2020.