Shared information and program plagiarism detection

@article{Chen2004SharedIA,
  title={Shared information and program plagiarism detection},
  author={Xin Chen and Brent Francia and Ming Li and Brian McKinnon and Amit Seker},
  journal={IEEE Transactions on Information Theory},
  year={2004},
  volume={50},
  pages={1545-1551}
}
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity, to answer this question and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have designed and implemented a practical system SID (Software Integrity Diagnosis system… 

Figures and Tables from this paper

Automated Detecting and Tracing for Plagiarized Programs using Gumbel Distribution Model
TLDR
A source code clustering algorithm using a probability model on extreme value distribution and pseudo-plagiarism which is a sort of virtual plagiarism forced by a very strong functional requirement in the specification are proposed.
Source Code Plagiarism Detection Using Biological String Similarity Algorithms
TLDR
Two new measures for determining the accuracy of a given technique are proposed and an approach to convert code files into strings which can be compared for similarity in order to detect plagiarism is described.
A Program Plagiarism Detection Model Based on Information Distance and Clustering
TLDR
A metric, based on information distance, is proposed, to measure similarity between two programs and clustering analysis,based on shared near neighbors, is applied in order to provide more beneficial and detailed information about the program plagiarism.
Syntax tree fingerprinting: a foundation for source code similarity detection
TLDR
The aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together
TLDR
Submission of articles is divided into small pieces and scans it to compare with connected databases to the server on internet and divides submitted articles in small pieces into two parts to be compared with existing databases.
Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation
TLDR
The dataset is designed for evaluation with an Information Retrieval (IR) perspective, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.
Efficient clustering-based source code plagiarism detection using PIY
TLDR
This work presents an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy, and is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.
A Novel Approach for Measurement of Source Code Similarity
TLDR
A novel approach for measurement of code similarity to detect similarity between codes is presented, which will lead to fair evaluation of assignments, as with the help of this tool the authors can easily detect Programming Assignment and copying of codes or function.
Plagiarism Detection in Programming Exercises Using a Markov Model Approach
TLDR
A Markov model based source code plagiarism detection tool is proposed (called MPlag), which applies Markov models to students' coding process data recorded while writing a program to detect plagiarism.
PLAGIAT : A CODE PLAGIARISM DETECTION TOOL
TLDR
A new model for the source code detection which works on the concepts of Machine Learning, Naïve Bayes algorithm, K – Nearest Neighbor and ADA Boost Meta Learning Algorithm are deployed in a combined manner.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
YAP3: improved detection of similarities in computer program and other texts
  • M. Wise
  • Computer Science
    SIGCSE '96
  • 1996
TLDR
YAP3, the third version of YAP, is reviewed, focusing on its novel underlying algorithm - Running-Karp-Rabin Greedy-String-Tiling (or RKS-GST), whose development arose from the observation with YAP and other systems that students shuffle independent code segments.
The similarity metric
TLDR
A new "normalized information distance" is proposed, based on the noncomputable notion of Kolmogorov complexity, and it is demonstrated that it is a metric and called the similarity metric.
An algorithmic approach to the detection and prevention of plagiarism
TLDR
This paper discuses one possible quantification which works well when applied to student computer pro grams and shows how this problem can be reduced by quantifyin g papers in such a way that equivalent papers are given equal values.
Winnowing: local algorithms for document fingerprinting
TLDR
The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
On the Performance of Data Compression Algorithms Based Upon String Matching
TLDR
It is shown that the resulting compression rate converges with probability one to a quantity computable as the infimum of an information theoretic functional over a set of auxiliary random variables; the quantity is strictly greater than the rate distortion function of the source except in some symmetric cases.
Language trees and zipping.
TLDR
A very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series based on data-compression techniques, featuring highly accurate results for language recognition, authorship attribution, and language classification.
A suboptimal lossy data compression based on approximate pattern matching
TLDR
For stationary mixing sequences, the problem investigated by Steinberg and Gutman by showing that a lossy extension of the Wyner-Ziv (1989) scheme cannot be optimal is settled, and the asymptotic behavior of the so-called approximate waiting time N/sub l/ is established.
An Introduction to Kolmogorov Complexity and Its Applications
TLDR
The book presents a thorough treatment of the central ideas and their applications of Kolmogorov complexity with a wide range of illustrative applications, and will be ideal for advanced undergraduate students, graduate students, and researchers in computer science, mathematics, cognitive sciences, philosophy, artificial intelligence, statistics, and physics.
A universal algorithm for sequential data compression
TLDR
The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Computer Algorithms for Plagiarism Detection
TLDR
A survey of computer algorithms used for the detection of student plagiarism and Ethical and administrative issues involving detected plagiarism are discussed.
...
...