Near-duplicate detection using GPU-based simhash scheme

Abstract

With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing… (More)
DOI: 10.1109/SMARTCOMP.2014.7043862

Topics

8 Figures and Tables

Slides referencing similar topics