Using Clustering to Identify Outlier Chunks of Text - Notebook for PAN at CLEF 2011

@inproceedings{Akiva2011UsingCT,
  title={Using Clustering to Identify Outlier Chunks of Text - Notebook for PAN at CLEF 2011},
  author={Navot Akiva},
  booktitle={CLEF},
  year={2011}
}
Intrinsic plagiarism detection is a sub-task of authorship identification in which outlier chunks must be detected solely on the basis of stylistic differences from the main body of the text. We present a first attempt at utilizing words that appear infrequently in a text as stylistic markers for distinguishing outlier chunks in the text. In the first phase of our method we cluster chunks of text represented by usage of infrequent words. In the second phase, we use a training corpus to identify… CONTINUE READING