Identifying Duplicate and Contradictory Information in Wikipedia


In this paper, we identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual… (More)
DOI: 10.1145/2756406.2756947


3 Figures and Tables

