Using Trainable Duplicate Detection for Automated Public Data Refining

Abstract

Public institutions share important data on the Web. These data are essential for public investigation and thus increase transparency. However, it is difficult to process them, since there are numerous mistypings, disambigu-ities and duplicates. In this paper we propose an automated approach for cleaning of these data, so that further querying result is reliable. We develop a duplicate detection method that can train weights of features on small amount of training samples and then predict duplicates on the rest of data. We evaluate our method on two real-world data sets.

4 Figures and Tables