Partial duplicate detection for large book collections

  title={Partial duplicate detection for large book collections},
  author={Ismet Zeki Yalniz and Ethem F. Can and R. Manmatha},
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact… CONTINUE READING
Highly Cited
This paper has 24 citations. REVIEW CITATIONS