Strategies for retrieving plagiarized documents

  title={Strategies for retrieving plagiarized documents},
  author={Benno Stein and S. M. Eissen and Martin Potthast},
For the identification of plagiarized passages in large document collections we present retrieval strategies which rely on stochastic sampling and chunk indexes. Using the entire Wikipedia corpus we compile n-gram indexes and compare them to a new kind of fingerprint index in a plagiarism analysis use case. Our index provides an analysis speed-up by factor 1.5 and is an order of magnitude smaller, while being equivalent in terms of precision and recall. 
An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE
Methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE are investigated, particularly when the original text has been modified through the replacement of words or phrases. Expand
Information Retrieval Techniques for Corpus Filtering Applied to External Plagiarism Detection
A set of approaches for corpus filtering in the context of document external plagiarism detection are presented, which include information retrieval methods and a document similarity measure based on a variant of tf-idf. Expand
Plagiarism Detection: An Overview of Text Alignment Techniques
This thesis is mainly concerned with the detailed analysis phase, more specifically with the problem of text alignment and the other subtasks that follow from it. Expand
An n-gram based Method for nearly Copy Detection in Plagiarism Systems
Evaluation of the method introduces a method based on n-gram to identify similar textual parts between two documents that has obtained both high accuracy and proper efficiency simultaneously. Expand
Application of Information Retrieval Techniques to Document Filtered Set Generation for External Plagiarism Detection
This paper presents an approach to generate document filtered sets using information retrieval techniques in the context of external document plagiarism detection, although the techniques detailed are applicable to any sort of documents or queries. Expand
Plagiarism Detection for Indonesian Texts
A novel document representation in the candidate document retrieval module and the hybrid of segmentation and similarity of hashing technique in the comparison module for Plagiarism Detection for Indonesian texts are proposed. Expand
Corpus and Evaluation Measures for Automatic Plagiarism Detection
A newly developed large-scale corpus of artificial plagiarism is developed useful for the evaluation of intrinsic as well as external plagiarism detection. Expand
Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism
This paper reports two experiments related to plagiarism detection where a model for distributional semantics and of sentence stylistics is used to compare sentence by sentence the likelihood of a text being partly plagiarised. Expand
Plagiarism detection by similarity join
A plagiarism detection algorithm that allows us to quickly compare online news articles with a collection of personal news articles and detect plagiarized passages with the same quality as a human is presented. Expand
Hybrid plagiarism detection method for French language
With the growth of the content found throughout the Web, every information can be plagiarized. Plagiarism is the process of using the ideas of another without naming the source. Consequently,Expand


Methods for Identifying Versioned and Plagiarized Documents
The identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents, and it is demonstrated that the identity measure is clearly superior for fingerprinting parameters. Expand
A Scalable System for Identifying Co-derivative Documents
Spex is presented, a novel hash-based algorithm for extracting duplicated chunks from a document collection and deco, a prototype system that makes use of spex, is described. Expand
Indexing Shared Content in Information Retrieval Systems
This paper describes a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once, and shows how this representation model can be encoded in an inverted index. Expand
Similarity Search in High Dimensions via Hashing
Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Expand
Fuzzy-Fingerprints for Text-Based Information Retrieval
This paper introduces aparticular form offuzzy-fingerprints—their construction, their interpretation, and their use in the field of information retrieval as well as the way of using them within a similarity search. Expand
A Practical Minimal Perfect Hashing Method
A novel algorithm based on random graphs to construct minimal perfect hash functions h, which outputs h in expected time O(n) for a set of n keys, improves the space requirement to 55% of a previous minimal perfect hashing scheme. Expand
SIGIR 2007 Proceedings Poster
  • SIGIR 2007 Proceedings Poster