Duplicate Detection of Records in Queries Using Clustering

Abstract

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.

2 Figures and Tables

Cite this paper

@inproceedings{Anitha2012DuplicateDO, title={Duplicate Detection of Records in Queries Using Clustering}, author={M. Anitha and Anand Srinivas and T . P . Shekhar and D . Sagar}, year={2012} }