Using Trainable Duplicate Detection for Automated Public Data Refining

Abstract

Public institutions share important data on the Web. These data are essential for public investigation and thus increase transparency. However, it is difficult to process them, since there are numerous mistypings, disambiguities and duplicates. In this paper we propose an automated approach for cleaning of these data, so that further querying result is reliable. We develop a duplicate detection method that can train weights of features on small amount of training samples and then predict duplicates on the rest of data. We evaluate our method on two realworld data sets.

4 Figures and Tables

Cite this paper

@inproceedings{Liptak2012UsingTD, title={Using Trainable Duplicate Detection for Automated Public Data Refining}, author={Martin Liptak}, year={2012} }