Near duplicate detection in an academic digital library

  title={Near duplicate detection in an academic digital library},
  author={Kyle Williams and C. Lee Giles},
  booktitle={ACM Symposium on Document Engineering},
The detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into the application of scalable simhash and shingle state of the art duplicate detection algorithms for detecting near duplicate documents in the CiteSeerX digital library. We empirically explored the duplicate detection methods and evaluated their… CONTINUE READING
Highly Cited
This paper has 34 citations. REVIEW CITATIONS

From This Paper

Figures, tables, and topics from this paper.
18 Citations
2 References
Similar Papers


Publications citing this paper.
Showing 1-10 of 18 extracted citations


Publications referenced by this paper.

Similar Papers

Loading similar papers…