An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis

@article{Cosma2012AnAT,
  title={An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis},
  author={Georgina Cosma},
  journal={IEEE Transactions on Computers},
  year={2012},
  volume={61},
  pages={379-394}
}
  • G. Cosma
  • Published 1 March 2012
  • Computer Science
  • IEEE Transactions on Computers
Plagiarism is a growing problem in academia. Academics often use plagiarism detection tools to detect similar source-code files. Once similar files are detected, the academic proceeds with the investigation process which involves identifying the similar source-code fragments within them that could be used as evidence for proving plagiarism. This paper describes PlaGate, a novel tool that can be integrated with existing plagiarism detection tools to improve plagiarism detection performance. The… 
Style analysis for source code plagiarism detection
TLDR
The aim of this thesis is to enhance methods for plagiarism detection in source code using a style analysis approach that has been used to detect authorship.
Source-code Similarity Detection and Detection Tools Used in Academia
TLDR
This review gives an overview of definitions of plagiarism, plagiarism detection tools, comparison metrics, obfuscation methods, datasets used for comparison, and algorithm types and identifies interesting insights about metrics and datasets for quantitative tool comparison and categorisation of detection algorithms.
EPlag: A two layer source code plagiarism detection system
TLDR
This paper has developed a source code plagiarism detection system and tried to improve the existing techniques by separating the suspected files and the non-plagiarized files, thus reducing the dataset for further comparison.
Searching source code fragments using incremental clustering
TLDR
Algorithm for source code parsing and processing as a part of a complex system for plagiarism detection and methods for vector search among clusters and the use of conditional entropy to select the important vector elements used in the search algorithm are proposed.
A Language-Independent Library for Observing Source Code Plagiarism
TLDR
A library for observing two plagiarism-suspected codes that is integrable, functional, and helpful for teaching assistants and can enhance teaching assistants' accuracy and reduce the tasks' completion time.
Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications
TLDR
The results of the performed evaluations indicate that currently available source code plagiarism detection tools are not robust against modifications which apply fine-grained transformations to the source code structure, and graph-based tools show potentially greater robustness to pervasive plagiarism-hiding modifications.
A Novel Approach for Detecting Logic Similarity in Plagiarised Source Code
TLDR
A novel approach to source code plagiarism detection is proposed that compares two programs for logic similarity and demonstrates that the approach is resilient to semantics-preserving transformations.
Version history based source code plagiarism detection in proprietary systems
TLDR
A novel approach that applies Mining Software Repositories (MSR) based techniques to the problem of plagiarism detection is proposed that creates a programming style profile for each maintenance engineer by mining the version history and uses that to detect source code commits that are likely to be plagiarized.
STYLE ANALYSIS FOR SOURCE CODE PLAGIARISM DETECTION
TLDR
A number of publications which report style comparison to detect source code plagiarism are reviewed in order to determine research gaps and explore areas where this approach can be improved.
Academic Source Code Plagiarism Detection by Measuring Program Behavioral Similarity
TLDR
BPlag is presented, a behavioural approach to source code plagiarism detection designed to be both robust to pervasive plagiarism-hiding transformations and accurate in the detection of plagiarised source code.
...
...

References

SHOWING 1-10 OF 168 REFERENCES
PDetect: A Clustering Approach for Detecting Plagiarism in Source Code Datasets
TLDR
A clustering oriented approach for facing the problem of source code plagiarism, designed such that it may be easily adapted over any keyword-based programming language and it is quite beneficial when compared with earlier plagiarism detection approaches.
Plagiarism à la Mode: A Comparison of Automated Systems for Detecting Suspected Plagiarism
TLDR
A comparison is presented of five systems, two based on attribute counting and three using metrics based on structure, and it is found that the systems based on structural information consistently equal or better the performance of systems based of attribute counting metrics.
Enhancing Computer-Aided Plagiarism Detection
TLDR
This work is dedicated to the development and the use of software instruments that help to reveal plagiarism, and building the taxonomy of existing plagiarism detection methods according to their speed and reliability characteristics.
MUDABlue: an automatic categorization system for open source repositories
Enriching reverse engineering with semantic clustering
TLDR
This paper analyzes how semantics of the source code are spread over the source artifacts using latent semantic indexing, an information retrieval technique that cluster artifacts that use similar terms, and reveals the most relevant terms for the computed clusters.
Semantic driven program analysis
  • A. Marcus
  • Computer Science
    20th IEEE International Conference on Software Maintenance, 2004. Proceedings.
  • 2004
TLDR
The paper advocates for the use of latent semantic indexing as the underlying support for the semantic driven analysis of existing software systems to support program understanding and software various maintenance tasks, such as recovery of traceability links between documentation and source code.
Software for detecting suspected plagiarism: comparing structure and attribute-counting systems
TLDR
A comparison is presented of two systems based on attribute counting and a structure-metric system that consistently equal or better the performance of systemsbased on attribute-counting metrics.
Automatic software clustering via Latent Semantic Analysis
TLDR
Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of software reuse is a new application of this method and a departure from the normal application domain of natural language.
Finding Plagiarisms among a Set of Programs with JPlag
TLDR
JPlag is a web service that finds pairs of similar programs among a given set of programs and its architecture and its comparsion algorithm is described, which is based on a known one called Greedy String Tiling.
...
...