Corpus ID: 12340004

Efficient Algorithm for Near Duplicate Documents Detection

  title={Efficient Algorithm for Near Duplicate Documents Detection},
  author={Gaudence Uwamahoro and Zhang Zuping},
Identification of duplicates or near duplicate documents in a set of documents is one of the major problems in information re trieval. Several methods to detect those documents have been proposed but their relevance is still an issue. In this paper we propose an algorithm based on word position which provides a reduced candidate size to search in and increases efficiency and effectiveness for partial documents relevance. In our experiments the results show that during search process for the… Expand
3 Citations

Figures, Tables, and Topics from this paper

Classification of the Approaches of the Near Duplicate Document Detection and Elimination
Efficient Algorithm for Removing Duplicate Documents
  • Highly Influenced


A Sentence-Based Copy Detection Approach for Web Documents
  • 62
Detecting near-duplicates for web crawling
  • 552
  • PDF
Document similarity based on concept tree distance
  • 74
Inverted files for text search engines
  • 1,135
  • PDF
Novelty detection based on sentence level patterns
  • 59
  • PDF
Detection of near-duplicate user generated contents: the SMS spam collection
  • 22
  • PDF
Retrieval and novelty detection at the sentence level
  • 301
  • PDF