Identifying Duplicate and Contradictory Information in Wikipedia

Abstract

In this paper, we identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual… (More)
DOI: 10.1145/2756406.2756947

Topics

3 Figures and Tables

Slides referencing similar topics