# Shared information and program plagiarism detection

@article{Chen2004SharedIA, title={Shared information and program plagiarism detection}, author={Xin Chen and Brent Francia and Ming Li and Brian McKinnon and Amit Seker}, journal={IEEE Transactions on Information Theory}, year={2004}, volume={50}, pages={1545-1551} }

A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity, to answer this question and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have designed and implemented a practical system SID (Software Integrity Diagnosis system…

## 283 Citations

Automated Detecting and Tracing for Plagiarized Programs using Gumbel Distribution Model

- Computer Science
- 2009

A source code clustering algorithm using a probability model on extreme value distribution and pseudo-plagiarism which is a sort of virtual plagiarism forced by a very strong functional requirement in the specification are proposed.

Source Code Plagiarism Detection Using Biological String Similarity Algorithms

- Computer ScienceJ. Inf. Knowl. Manag.
- 2014

Two new measures for determining the accuracy of a given technique are proposed and an approach to convert code files into strings which can be compared for similarity in order to detect plagiarism is described.

A Program Plagiarism Detection Model Based on Information Distance and Clustering

- Computer ScienceThe 2007 International Conference on Intelligent Pervasive Computing (IPC 2007)
- 2007

A metric, based on information distance, is proposed, to measure similarity between two programs and clustering analysis,based on shared near neighbors, is applied in order to provide more beneficial and detailed information about the program plagiarism.

Syntax tree fingerprinting: a foundation for source code similarity detection

- Computer Science
- 2009

The aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.

Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together

- Computer Science
- 2015

Submission of articles is divided into small pieces and scans it to compare with connected databases to the server on internet and divides submitted articles in small pieces into two parts to be compared with existing databases.

Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation

- Computer ScienceInformatics Educ.
- 2019

The dataset is designed for evaluation with an Information Retrieval (IR) perspective, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.

Program Plagiarism Detection Using Parse Tree Kernels

- Computer SciencePRICAI
- 2006

A comparison with existing systems such as SID and JPlag shows that the proposed system can detect plagiarism more accurately due to its ability of handling structural information.

Efficient clustering-based source code plagiarism detection using PIY

- Computer ScienceKnowledge and Information Systems
- 2014

This work presents an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy, and is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.

A Novel Approach for Measurement of Source Code Similarity

- Computer Science2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)
- 2020

A novel approach for measurement of code similarity to detect similarity between codes is presented, which will lead to fair evaluation of assignments, as with the help of this tool the authors can easily detect Programming Assignment and copying of codes or function.

Plagiarism Detection in Programming Exercises Using a Markov Model Approach

- Computer Science
- 2013

A Markov model based source code plagiarism detection tool is proposed (called MPlag), which applies Markov models to students' coding process data recorded while writing a program to detect plagiarism.

## References

SHOWING 1-10 OF 25 REFERENCES

An algorithmic approach to the detection and prevention of plagiarism

- Computer ScienceSGCS
- 1976

This paper discuses one possible quantification which works well when applied to student computer pro grams and shows how this problem can be reduced by quantifyin g papers in such a way that equivalent papers are given equal values.

Winnowing: local algorithms for document fingerprinting

- Computer ScienceSIGMOD '03
- 2003

The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.

On the Performance of Data Compression Algorithms Based Upon String Matching

- Computer ScienceIEEE Trans. Inf. Theory
- 1998

It is shown that the resulting compression rate converges with probability one to a quantity computable as the infimum of an information theoretic functional over a set of auxiliary random variables; the quantity is strictly greater than the rate distortion function of the source except in some symmetric cases.

A suboptimal lossy data compression based on approximate pattern matching

- Computer ScienceIEEE Trans. Inf. Theory
- 1997

For stationary mixing sequences, the problem investigated by Steinberg and Gutman by showing that a lossy extension of the Wyner-Ziv (1989) scheme cannot be optimal is settled, and the asymptotic behavior of the so-called approximate waiting time N/sub l/ is established.

An Introduction to Kolmogorov Complexity and Its Applications

- Computer ScienceTexts and Monographs in Computer Science
- 1993

The book presents a thorough treatment of the central ideas and their applications of Kolmogorov complexity with a wide range of illustrative applications, and will be ideal for advanced undergraduate students, graduate students, and researchers in computer science, mathematics, cognitive sciences, philosophy, artificial intelligence, statistics, and physics.

A universal algorithm for sequential data compression

- Computer ScienceIEEE Trans. Inf. Theory
- 1977

The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.

Algorithmic clustering of music

- Computer ScienceProceedings of the Fourth International Conference onWeb Delivering of Music, 2004. EDELMUSIC 2004.
- 2004

We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is…

An information-based sequence distance and its application to whole mitochondrial genome phylogeny

- BiologyBioinform.
- 2001

A sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov complexity and a program to estimate this distance is presented.

Distance based indexing for string proximity search

- Computer ScienceProceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
- 2003

It is shown that several distance measures, such as the compression distance and weighted character edit distance are almost metrics, and how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space.

A compression algorithm for DNA sequences and its applications in genome comparison

- PhysicsRECOMB '00
- 2000

A theory of measuring the relatedness between two DNA sequences, and strong experimental support for this theory is presented, which is demonstrated by correctly constructing a 16S (18S) rRNA tree, and a whole genome tree for several species of bacteria.