Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training

@inproceedings{Wei2018PositiveAU,
  title={Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training},
  author={Huihui Wei and Ming Li},
  booktitle={IJCAI},
  year={2018}
}
Software clone detection is an important problem for software maintenance and evolution and it has attracted lots of attentions. However, existing approaches ignore a fact that people would label the pairs of code fragments as clone only if they happen to discover the clones while a huge number of undiscovered clone pairs and non-clone pairs are left unlabeled. In this paper, we argue that the clone detection task in the real-world should be formalized as a Positive-Unlabeled (PU) learning… Expand
Find Me if You Can: Deep Software Clone Detection by Exploiting the Contest between the Plagiarist and the Detector
TLDR
A novel clone detection approach, namely ACD, is proposed to mimic the adversarial process between the plagiarist and the detector, which enables to not only build strong a clone detector but also model the behavior of the plagiarists. Expand
Neural Detection of Semantic Code Clones Via Tree-Based Convolution
TLDR
This work proposes a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens, and addresses the limitation that source code has an unlimited vocabulary of tokens and models. Expand
An Effective Semantic Code Clone Detection Framework Using Pairwise Feature Fusion
TLDR
This work proposes a novel detection framework using machine learning for automated detection of all four type of clones using AST and PDG features and finds that boosted tree algorithms like XGBoost are quite competitive in clone detection. Expand
Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree
TLDR
The first to apply graph neural networks on the domain of code clone detection and build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST), which outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks. Expand
Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks
TLDR
A prototype tool HOLMES is developed, based on the novel approach to semantic code clone detection, and empirically evaluated it on popular code clone benchmarks, showing thatholMES performs considerably better than the other state-of-the-art tool, TBCCD. Expand
MVP: Detecting Vulnerabilities using Patch-Enhanced Vulnerability Signatures
TLDR
This paper proposes a novel approach to detect recurring vulnerabilities with low false positives and low false negatives and implements a tool named MVP, which significantly outperformed state-of-the-art clone-based and function matching-based recurring vulnerability detection approaches. Expand
Modular Tree Network for Source Code Representation Learning
TLDR
This article proposes a modular tree network that dynamically composes different neural network units into tree structures based on the input AST, and can capture the semantic differences between types of AST substructures. Expand
MixPUL: Consistency-based Augmentation for Positive and Unlabeled Learning
TLDR
A simple yet effective data augmentation method, coined~\algo, based on consistency regularization which provides a new perspective of using PU data and reduces margin loss between positive and unlabeled pairs, which explicitly optimizes AUC and yields faster convergence. Expand
Positive and Unlabeled Learning with Label Disambiguation
TLDR
A novel algorithm dubbed as “Positive and Unlabeled learning with Label Disambiguation” (PULD) is proposed, which first regard all the unlabeled examples in PU learning as ambiguously labeled as positive and negative, and then employs the margin-based label disambIGuation strategy, which enlarges the margin of classifier response between the most likely label and the less likely one, to find the unique ground-truth label of each unlabeling example. Expand
Online Positive and Unlabeled Learning
TLDR
A novel positive and unlabeled learning algorithm in an online training mode, which trains a classifier solely on the positive and unlabeled data arriving in a sequential order, and shows that for any coming new single datum, the model can be updated independently and incrementally by gradient based online learning method. Expand
...
1
2
...

References

SHOWING 1-10 OF 27 REFERENCES
Deep learning code fragments for code clone detection
TLDR
This work introduces learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository, and compared its approach to a traditional structure-oriented technique and found that it detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Expand
Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code
TLDR
Experiments on software clone detection benchmarks indicate that the CDLH approach is effective and outperforms the state-of-the-art approaches in software functional clone detection. Expand
Towards a Big Data Curated Benchmark of Inter-project Code Clones
TLDR
A Big Data clone detection benchmark that consists of known true and false positive clones in a Big Data inter-project Java repository and it is shown how the benchmark can be used to measure the recall and precision of clone detection techniques. Expand
Language-Independent Clone Detection Applied to Plagiarism Detection
TLDR
This paper addresses the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students, and proposes an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents. Expand
SourcererCC: Scaling Code Clone Detection to Big-Code
TLDR
This paper presents a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation, and evaluates the scalability, execution time, recall and precision, and compares it to four publicly available and state-of-the-art tools. Expand
CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code
TLDR
A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems. Expand
Explaining and Harnessing Adversarial Examples
TLDR
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Expand
NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization
  • C. Roy, J. Cordy
  • Computer Science
  • 2008 16th IEEE International Conference on Program Comprehension
  • 2008
TLDR
A new language- specific parser-based but lightweight clone detection approach exploiting a novel application of a source transformation system that is capable of finding near-miss clones with high precision and recall, and with reasonable performance. Expand
Adversarial Training Methods for Semi-Supervised Text Classification
TLDR
This work extends adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. Expand
Convex Formulation for Learning from Positive and Unlabeled Data
TLDR
This paper proposes a convex formulation for PU classification that can still cancel the bias, and proves that the estimators converge to the optimal solutions at the optimal parametric rate. Expand
...
1
2
3
...