Shared information and program plagiarism detection

@article{Chen2004SharedIA,
  title={Shared information and program plagiarism detection},
  author={Xin Chen and Brent Francia and Ming Li and Brian McKinnon and Amit Seker},
  journal={IEEE Transactions on Information Theory},
  year={2004},
  volume={50},
  pages={1545-1551}
}
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity, to answer this question and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have designed and implemented a practical system SID (Software Integrity Diagnosis system… 

Figures and Tables from this paper

Automated Detecting and Tracing for Plagiarized Programs using Gumbel Distribution Model
TLDR
A source code clustering algorithm using a probability model on extreme value distribution and pseudo-plagiarism which is a sort of virtual plagiarism forced by a very strong functional requirement in the specification are proposed.
Source Code Plagiarism Detection Using Biological String Similarity Algorithms
TLDR
Two new measures for determining the accuracy of a given technique are proposed and an approach to convert code files into strings which can be compared for similarity in order to detect plagiarism is described.
A Program Plagiarism Detection Model Based on Information Distance and Clustering
TLDR
A metric, based on information distance, is proposed, to measure similarity between two programs and clustering analysis,based on shared near neighbors, is applied in order to provide more beneficial and detailed information about the program plagiarism.
Syntax tree fingerprinting: a foundation for source code similarity detection
TLDR
The aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together
TLDR
Submission of articles is divided into small pieces and scans it to compare with connected databases to the server on internet and divides submitted articles in small pieces into two parts to be compared with existing databases.
Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation
TLDR
The dataset is designed for evaluation with an Information Retrieval (IR) perspective, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.
Program Plagiarism Detection Using Parse Tree Kernels
TLDR
A comparison with existing systems such as SID and JPlag shows that the proposed system can detect plagiarism more accurately due to its ability of handling structural information.
Efficient clustering-based source code plagiarism detection using PIY
TLDR
This work presents an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy, and is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.
A Novel Approach for Measurement of Source Code Similarity
TLDR
A novel approach for measurement of code similarity to detect similarity between codes is presented, which will lead to fair evaluation of assignments, as with the help of this tool the authors can easily detect Programming Assignment and copying of codes or function.
Plagiarism Detection in Programming Exercises Using a Markov Model Approach
TLDR
A Markov model based source code plagiarism detection tool is proposed (called MPlag), which applies Markov models to students' coding process data recorded while writing a program to detect plagiarism.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
An algorithmic approach to the detection and prevention of plagiarism
TLDR
This paper discuses one possible quantification which works well when applied to student computer pro grams and shows how this problem can be reduced by quantifyin g papers in such a way that equivalent papers are given equal values.
Winnowing: local algorithms for document fingerprinting
TLDR
The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
On the Performance of Data Compression Algorithms Based Upon String Matching
TLDR
It is shown that the resulting compression rate converges with probability one to a quantity computable as the infimum of an information theoretic functional over a set of auxiliary random variables; the quantity is strictly greater than the rate distortion function of the source except in some symmetric cases.
A suboptimal lossy data compression based on approximate pattern matching
TLDR
For stationary mixing sequences, the problem investigated by Steinberg and Gutman by showing that a lossy extension of the Wyner-Ziv (1989) scheme cannot be optimal is settled, and the asymptotic behavior of the so-called approximate waiting time N/sub l/ is established.
An Introduction to Kolmogorov Complexity and Its Applications
TLDR
The book presents a thorough treatment of the central ideas and their applications of Kolmogorov complexity with a wide range of illustrative applications, and will be ideal for advanced undergraduate students, graduate students, and researchers in computer science, mathematics, cognitive sciences, philosophy, artificial intelligence, statistics, and physics.
A universal algorithm for sequential data compression
TLDR
The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Algorithmic clustering of music
We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
TLDR
A sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov complexity and a program to estimate this distance is presented.
Distance based indexing for string proximity search
TLDR
It is shown that several distance measures, such as the compression distance and weighted character edit distance are almost metrics, and how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space.
A compression algorithm for DNA sequences and its applications in genome comparison
TLDR
A theory of measuring the relatedness between two DNA sequences, and strong experimental support for this theory is presented, which is demonstrated by correctly constructing a 16S (18S) rRNA tree, and a whole genome tree for several species of bacteria.
...
...