A Machine Learning Approach for Vulnerability Curation

  title={A Machine Learning Approach for Vulnerability Curation},
  author={Yang Chen and Andrew E. Santosa and Ang Ming Yi and Abhishek Sharma and Asankhaya Sharma and D. Lo},
  journal={Proceedings of the 17th International Conference on Mining Software Repositories},
  • Yang Chen, A. Santosa, D. Lo
  • Published 29 June 2020
  • Computer Science
  • Proceedings of the 17th International Conference on Mining Software Repositories
Software composition analysis depends on database of open-source library vulerabilities, curated by security researchers using various sources, such as bug tracking systems, commits, and mailing lists. We report the design and implementation of a machine learning system to help the curation by by automatically predicting the vulnerability-relatedness of each data item. It supports a complete pipeline from data collection, model training and prediction, to the validation of new models before… 
Security Issue Classification for Vulnerability Management with Semi-supervised Learning
This work proposes the use of semi-supervised machine learning to classify issues as security-related to provide additional vulnerabilities in an automated pipeline, and its models, based on a Hierarchical Attention Network, outperform previously proposed models on a manually labelled test dataset.
TRACER: Finding Patches for Open Source Software Vulnerabilities
An empirical study is conducted to understand the quality and characteristics of patches for OSS vulnerabilities in two state-of-the-art vulnerability databases and the first automated approach, named TRACER, is proposed, to find patches for an OSS vulnerability from multiple sources.
SPI: Automated Identification of Security Patches via Commits
A deep learning-based security patch identification system that consists of two composite neural networks that utilizes pretrained word representations learned from commits of open source repositories and one code-revision neural network that takes code before revision and after revision and learns the distinction on the statement level.
Predictive Models in Software Engineering: Challenges and Opportunities
The key models and approaches used, classify the different models, summarize the range of key application areas, and analyze research results are described and a proposed research road map for these opportunities is provided.
A Survey on Data-driven Software Vulnerability Assessment and Prioritization
A survey provides a taxonomy of the past research efforts and highlights the best practices for data-driven SV assessment and prioritization and discusses the current limitations and propose potential solutions to address such issues.
Security Bug Report Usage for Software Vulnerability Research: A Systematic Mapping Study
Findings from a systematic mapping study of research that use security bug reports for software vulnerability research can be leveraged to identify research opportunities in the domains of software vulnerability classification and automated vulnerability repair techniques.
Automated Identification of Libraries from Vulnerability Data: Can We Do Better?
Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify


Automated Identification of Libraries from Vulnerability Data
This work formulates and solves for the first time library name identification from NVD data as XML, and deploys the solution in a complete production system.
Machine learning for finding bugs: An initial report
While on the surface the initial results were encouraging, further investigation suggests that the machine learning techniques used are not suitable replacements for static program analysis tools due to low precision of the results.
Toward Large-Scale Vulnerability Discovery using Machine Learning
This paper presents an approach that uses lightweight static and dynamic features to predict if a test case is likely to contain a software vulnerability using machine learning techniques, and developed and implemented VDiscover, a tool that uses state-of-the-art Machine Learning techniques to predict vulnerabilities in test cases.
A Practical Approach to the Automatic Classification of Security-Relevant Commits
  • A. Sabetta, M. Bezzi
  • Computer Science
    2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)
  • 2018
An approach that uses machine-learning to analyze source code repositories and to automatically identify commits that are security-relevant (i.e., that are likely to fix a vulnerability) is proposed, requiring a significantly smaller amount of training data and employing a simpler architecture.
Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning
This paper proposes a method for assisted discovery of vulnerabilities in source code by embedding code in a vector space and automatically determining API usage patterns using machine learning, which can be exploited to guide the auditing of code and to identify potentially vulnerable code with similar characteristics.
Automated identification of security issues from commit messages and bug reports
This work describes an efficient automatic vulnerability identification system geared towards tracking large-scale projects in real time using natural language processing and machine learning techniques and achieves promising results on vulnerability identification.
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits
A new method of finding potentially dangerous code in code repositories with a significantly lower false-positive rate than comparable systems is presented, which combines code-metric analysis with metadata gathered from code repositories to help code review teams prioritize their work.
Automated vulnerability detection system based on commit messages
A large-scale crawling of Git commits for some popular open source repositories is conducted, a web-based triage system is developed, and a deep neural network is implemented to automatically identify vulnerability-fixing commits (VFC) based on the commit messages.
When a Patch Goes Bad: Exploring the Properties of Vulnerability-Contributing Commits
This study traced 68 vulnerabilities in the Apache HTTP server back to the version control commits that contributed the vulnerable code originally, and showed that VCCs are large: more than twice as much code churn on average than non-VCCs, even when normalized against lines of code.
The importance of accounting for real-world labelling when predicting software vulnerabilities
The results reveal that the unrealistic labelling assumption can profoundly mis- lead the scientific conclusions drawn; suggesting highly effective and deployable prediction results vanish when the authors fully account for realistically available labelling in the experimental methodology.