A repetition based measure for verification of text collections and for text categorization

@inproceedings{Khmelev2003ARB,
  title={A repetition based measure for verification of text collections and for text categorization},
  author={Dmitry V. Khmelev and William John Teahan},
  booktitle={SIGIR},
  year={2003}
}
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they… CONTINUE READING
Highly Cited
This paper has 91 citations. REVIEW CITATIONS
54 Citations
1 References
Similar Papers

Citations

Publications citing this paper.
Showing 1-10 of 54 extracted citations

91 Citations

051015'05'08'11'14'17
Citations per Year
Semantic Scholar estimates that this publication has 91 citations based on the available data.

See our FAQ for additional information.

References

Publications referenced by this paper.

Similar Papers

Loading similar papers…