Commit2Vec: Learning Distributed Representations of Code Changes

@article{Lozoya2021Commit2VecLD,
  title={Commit2Vec: Learning Distributed Representations of Code Changes},
  author={Roc{\'i}o Cabrera Lozoya and Arnaud Baumann and Antonino Sabetta and Michele Bezzi},
  journal={SN Comput. Sci.},
  year={2021},
  volume={2},
  pages={150}
}
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e… 
Assessing the Effectiveness of Syntactic Structure to Learn Code Edit Representations
TLDR
This paper evaluates how using structural information from AST, i.e., paths between AST leaf nodes can help with the task of code edit classification on two datasets of fine-grained syntactic edits, and determines the effect of using such syntactic structure for the problem of classifying code edits.
CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model
TLDR
This work proposes CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model, to deal with various code intelligence tasks and introduces two novel pretraining objectives, to predict the edges between nodes in the abstract syntax tree.
Co-training for Commit Classification
TLDR
This paper applies co-training, a semi-supervised learning method, to take advantage of the two views available – the commit message and the code changes – to improve commit classification.
A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python
TLDR
All the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy in predicting Python source code vulnerabilities.
Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers
TLDR
The extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent commits in Machine Learning (ML) applications is studied and it is found that the combination of the method with commit2vec represents a tangible improvement over the state of the art in the automatic identification of commits that fix vulnerabilities.
Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits
TLDR
A family of hierarchical neural network models are built for the identification of security-relevant commits by evaluating five different input representations and showing that models that learn on tokens extracted from the commit diff are simpler and more effective than models that learning from path-contexts extracted fromThe AST.
Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories
TLDR
This paper presents an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address the problem of comprehensive sources of accurate vulnerability data.
MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
TLDR
The result of this work is a mature,Ne-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions.
Detection, assessment and mitigation of vulnerabilities in open source dependencies
TLDR
The lessons learned when maturing the tool from a research prototype to an industrial-grade solution are reported on and an empirical study was conducted to compare its detection capabilities with those of OWASP Dependency Check.
ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking
TLDR
A novel commit message generation model, named ATOM, which explicitly incorporates abstract syntax tree for representing code changes and integrates both retrieved and generated messages through hybrid ranking, which demonstrates the effectiveness of ATOM in generating accurate code commit messages.
...
1
2
...

References

SHOWING 1-10 OF 21 REFERENCES
code2vec: learning distributed representations of code
TLDR
A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus.
code2seq: Generating Sequences from Structured Representations of Code
TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models.
A Convolutional Attention Network for Extreme Summarization of Source Code
TLDR
An attentional neural network that employs convolution on the input tokens to detect local time-invariant and long-range topical attention features in a context-dependent way to solve the problem of extreme summarization of source code snippets into short, descriptive function name-like summaries is introduced.
A Survey of Machine Learning for Big Code and Naturalness
TLDR
This article presents a taxonomy based on the underlying design principles of each model and uses it to navigate the literature and discuss cross-cutting and application-specific challenges and opportunities.
Suggesting accurate method and class names
TLDR
A neural probabilistic language model for source code that is specifically designed for the method naming problem is introduced, and a variant of the model is introduced that is, to the knowledge, the first that can propose neologisms, names that have not appeared in the training corpus.
A Literature Study of Embeddings on Source Code
TLDR
In summary, word embedding has been successfully applied on different granularities of source code and with access to countless open-source repositories, the potential of applying other data-driven natural language processing techniques on source code in the future is seen.
A Practical Approach to the Automatic Classification of Security-Relevant Commits
  • A. Sabetta, M. Bezzi
  • Computer Science
    2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)
  • 2018
TLDR
An approach that uses machine-learning to analyze source code repositories and to automatically identify commits that are security-relevant (i.e., that are likely to fix a vulnerability) is proposed, requiring a significantly smaller amount of training data and employing a simpler architecture.
Detecting Copy Directions among Programs Using Extreme Learning Machines
TLDR
This work constructs feature space for describing features of every two programs with possible plagiarism relationship by employing extreme learning machine (ELM), and proposes a feedback framework to find a good feature space that can achieve both accuracy and efficiency.
Convolutional Neural Networks over Tree Structures for Programming Language Processing
TLDR
A novel tree-based convolutional neural network (TBCNN) is proposed for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information.
A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software
TLDR
The dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories, and is released under an open-source license together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories.
...
1
2
3
...