Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

  title={Towards Learning (Dis)-Similarity of Source Code from Program Contrasts},
  author={Yangruibo Ding and Luca Buratti and Saurabh Pujar and Alessandro Morari and Baishakhi Ray and Saikat Chakraborty},
Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO (DIS-similarity of COde), a novel self-supervised model focusing on identifying (dis)similar functionalities of source code. Different from existing works, our approach does not require a huge amount of randomly collected datasets. Rather, we design structure-guided code transformation algorithms to generate synthetic code… 

Figures and Tables from this paper


Contrastive Code Representation Learning
Contracode is proposed: a contrastive pre-training task that learns code functionality, not form, and improves summarization and TypeScript type inference accuracy by 2 to 13 percentage points over competitive baselines.
A detection framework for semantic code clones and obfuscated code
Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, Devign is proposed, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations.
Unified Pre-training for Program Understanding and Generation
Analysis reveals that PLBART learns program syntax, style, logical flow, and style that are crucial to program semantics and thus excels even with limited annotations, and outperforms or rivals state-of-the-art models.
CODIT: Code Editing With Tree-Based Neural Models
A novel tree-based neural network system to model source code changes and learn code change patterns from the wild and realizes the model with a change suggestion engine, Codit, and evaluation shows the effectiveness of Codit in learning and suggesting patches.
VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery
VUDDY outperformed four state-of-the-art code clone detection techniques in terms of both scalability and accuracy, and proved its effectiveness by detecting zero-day vulnerabilities in widely used software systems, such as Apache HTTPD and Ubuntu OS Distribution.
MISIM: An End-to-End Neural Code Similarity System
MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax, and provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters.
VulDeePecker: A Deep Learning-Based System for Vulnerability Detection
The study of using deep learning-based vulnerability detection to relieve human experts from the tedious and subjective task of manually defining features and Experimental results show that VulDeePecker can achieve much fewer false negatives and reasonable false positives than other approaches.
VulPecker: an automated vulnerability detection system based on code similarity analysis
Vulnerability Pecker is presented, a system for automatically detecting whether a piece of software source code contains a given vulnerability or not, and experiments show that VulPecker detects 40 vulnerabilities that are not published in the National Vulnerability Database (NVD).
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
A new pre-training objective, DOBF, is introduced that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code and shows that models pre-trained with DOBF outperform existing approaches on multiple downstream tasks.