Duplicate Record Detection: A Survey

@article{Elmagarmid2007DuplicateRD,
  title={Duplicate Record Detection: A Survey},
  author={A. Elmagarmid and Panagiotis G. Ipeirotis and V. Verykios},
  journal={IEEE Transactions on Knowledge and Data Engineering},
  year={2007},
  volume={19},
  pages={1-16}
}
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to… Expand
A Study of Progressive Techniques for Efficient Duplicate Detection
---Databases contains very large datasets, where various duplicate records are present. The duplicate records occur when data entries are stored in a uniform manner in the database, resolving theExpand
An Introduction to Duplicate Detection
TLDR
This lecture examines closely the two main components to overcome the difficulties of automatically detecting duplicates: Similarity measures are used to automatically identify duplicates when comparing two records and algorithms developed to perform on very large volumes of data in search for duplicates. Expand
Data Duplicate Detection
  • Nikita Medidar, M. Chavan
  • Computer Science
  • 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT)
  • 2018
TLDR
This paper implements two methods of duplicate detection and also compares them to the traditional methods to exhibit their efficiency. Expand
Framework for Evaluating Clustering Algorithms in Duplicate Detection
TLDR
This work uses Stringer to evaluate the quality of the clusters obtained from several unconstrained clustering algorithms used in concert with approximate join techniques and reveals that some clustering algorithm that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability. Expand
Progressive of Duplicate Detection Using Adaptive Window Technique
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a partExpand
DuDe: The Duplicate Detection Toolkit
TLDR
This paper presents the DuDe architecture and its workflow for duplicate detection, and shows that DuDe allows to easily compare different algorithms and similarity measures, which is an important step towards a duplicate detection benchmark. Expand
A Survey on Removal of Duplicate Records in Database
TLDR
A thorough analysis of similarity metrics to identify similar fields in records and a set of algorithms and duplicate detection tools to detect and remove the replicas from the database are presented. Expand
An Efficient Method to Detect Duplicate Data in Databases
Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time:Expand
Duplicate Detection with Map Reduce and Deletion Procedure
TLDR
This paper to study about the progressive duplication algorithm with the help of map reduce to detect the duplicates data and delete those duplicate records. Expand
A generalization of blocking and windowing algorithms for duplicate detection
TLDR
This work presents a new algorithm called Sorted Blocks in several variants, which generalizes both blocking and windowing on duplicates detection and shows that the new algorithm needs fewer comparisons to find the same number of duplicates. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 142 REFERENCES
An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records
TLDR
An eecient algorithm for recognizing clusters of approximately duplicate records that typically reduces by over 75% the number of times that the expensive pair-wise record matching (Smith-Waterman or other) is applied, without impairing accuracy. Expand
Automating the approximate record-matching process
TLDR
This paper addresses the problem of matching records which refer to the same entity by computing their similarity by deploying advanced data-mining techniques for dealing with the high computational and inferential complexity of approximate record matching. Expand
Approximate String Joins in a Database (Almost) for Free
TLDR
This paper develops a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them, and demonstrates experimentally the benefits of the technique over the direct use of UDFs. Expand
Searching with numbers
TLDR
This work proposes a new approach to search specification documents by first establishing correspondences between values and their names that does not require this correspondence to be accurately established and gets high precision in the answers on real datasets from a variety of domains. Expand
Text joins in an RDBMS for web data integration
TLDR
This paper adopts the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources and implements the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Expand
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
TLDR
This paper develops a system for accomplishing this Data Cleansing task and demonstrates its use for cleansing lists of names of potential customers in a direct marketing-type application and reports on the successful implementation for a real-world database that conclusively validates results previously achieved for statistically generated data. Expand
Learning to match and cluster large high-dimensional data sets for data integration
TLDR
Techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain are described. Expand
Integration of heterogeneous databases without common domains using queries based on textual similarity
TLDR
This paper rejects the assumption that global domains can be easily constructed, and assumes instead that the names are given in natural language text, and proposes a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. Expand
Multi-Relational Record Linkage
Data cleaning and integration is typically the most expen- sive step in the KDD process. A key part, known as record linkage or de-duplication, is identifying which records in a database refer to theExpand
Data integration using similarity joins and a word-based information representation language
TLDR
WHIRL is described, a “soft” database management system which supports “similarity joins,” based on certain robust, general-purpose similarity metrics for text, which enables fragments of text to be used as keys. Expand
...
1
2
3
4
5
...