• Corpus ID: 16306957

Methods for Intrinsic Plagiarism Detection and Author Diarization

  title={Methods for Intrinsic Plagiarism Detection and Author Diarization},
  author={Mikhail P. Kuznetsov and Anastasia Motrenko and Rita Kuznetsova and Vadim V. Strijov},
The paper investigates methods for intrinsic plagiarism detection and author diarization. We developed a plagiarism detection method based on constructing an author style function from features of text sentences and detecting outliers. We adapted the method for the diarization problem by segmenting author style statistics on text parts, which correspond to different authors. Both methods were tested on the PAN-2011 collection for the intrinsic plagiarism detection and implemented for the PAN… 

Figures and Tables from this paper

Intrinsic Detection of Plagiarism based on Writing Style Grouping

A hybrid approach is presented that constructs a style function from stylometric features and detects the outliers in intrinsic plagiarism detection and the obtained results outperform the ones obtained by the best state-of-the-art methods.

Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning

The main idea consists in profiling the style of the original author and marking as outliers the passages that seem to differ significantly within the document itself, and considering the reality of unbalanced training dataset in intrinsic plagiarism detection as a major parameter of the problem.

Intrinsic Plagiarism Detection System Using Stylometric Features and DBSCAN

A simplified approach is proposed for developing an intrinsic plagiarism detector which is helpful in detecting plagiarism even when no reference corpus is available and has an easy to use interactive interface.

Automatic Generation of Summary Obfuscation Corpus for Plagiarism Detection

This paper uses a Named Entity Recognizer to identify the entities within an original document, its associated summaries, and target documents and uses this information to create a summary obfuscation corpus for the task of plagiarism detection.

Academic Plagiarism Detection

The integration of heterogeneous analysis methods for textual and non-textual content features using machine learning is seen as the most promising area for future research contributions to improve the detection of academic plagiarism further.

Detecting a Change of Style using Text Statistics: Notebook for PAN at CLEF 2018

This paper addresses style change detection problem at PAN’18 author identification task by considering supervised problem statement with the whole text as a training object and a roposed approach based on three types of features: text statistics, hashing and high dimensional text vectors.

Style Breach Detection with Neural Sentence Embeddings

A method based on mapping sentences into high dimensional vector space based on using the pre-trained encoder-decoder model for constructing an author style function and detecting outliers for style breach detection task.

On the use of character n-grams as the only intrinsic evidence of plagiarism

It is demonstrated empirically that the low- and the high-frequency n-grams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited.

Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering

This edition of PAN focuses on style breach detection and author clustering, two unsupervised authorship analysis tasks, and provides both benchmark data and an evaluation framework to compare different approaches.

A Model for Style Breach Detection at a Glance: Notebook for PAN at CLEF 2018

This year’s PAN Author Identification sub-task for style change detection deals with a single question, whether or not a document has multiple authors, and a simple straightforward and fast approach is proposed in this document.



Intrinsic Plagiarism Detection Using Character n-gram Profiles

A new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity measure originally proposed for author identification.

Intrinsic Plagiarism Detection using N-gram Classes

A novel languageindependent intrinsic plagiarism detection method which is based on a new text representation that is called n-gram classes is introduced which is comparable to the best state-of-the-art methods.

Intrinsic Plagiarism Detection

It is shown that it is possible to identify potentially plagiarized passages by analyzing a single document with respect to variations in writing style, and new features for the quantification of style aspects are added.

Outlier-Based Approaches for Intrinsic and External Plagiarism Detection

This work includes the inclusion of text outlier detection methodologies to enhance both intrinsic and external plagiarism detection and shows that the approach is highly competitive with respect to the leading research teams in plagiarism Detection.

External and Intrinsic Plagiarism Detection Using Vector Space Models

This work presents a conceptually simple space partitioning approach to achieve search times sub linear in the number of ref- erence documents, trading precision for speed.

An Evaluation Framework for Plagiarism Detection

Empirical evidence is given that the construction of tailored training corpora for plagiarism detection can be automated, and hence be done on a large scale.

Overview of the 6th International Competition on Plagiarism Detection

Thispaper overviews 18 plagiarism detectors that have been developed and evaluated within PAN'10, highlighting several important aspects of plagiarism de- tection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length.

Clustering by Authorship Within and Across Documents

An overview of the shared tasks on author clustering and author diarization at PAN 2016 is presented including evaluation datasets, measures, results, as well as a survey of a total of 10 submissions.

A comparison of extrinsic clustering evaluation metrics based on formal constraints

This article defines a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families, and proposes a modified version of Bcubed that avoids the problems found with other metrics.

Segmenting Time Series: A Survey and Novel Approach

This paper undertake the first extensive review and empirical comparison of all proposed techniques for mining time series data and introduces a novel algorithm that is empirically show to be superior to all others in the literature.