• Corpus ID: 15277561

A Probabilistic Deduplication , Record Linkage and Geocoding System

@inproceedings{Churches2005APD,
  title={A Probabilistic Deduplication , Record Linkage and Geocoding System},
  author={Tim Churches and Peter Christen},
  year={2005}
}
In many data mining projects in the health sector information from multiple data sources needs to be cleaned, deduplicated and linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient. Most of the time the linkage process is challenged by the lack of a common unique entity identifier. Additionally, personal information, like names and addresses, are frequently recorded with typographical errors, can be… 

Figures and Tables from this paper

Geocoding Billions of Addresses: Toward a Spatial Record Linkage System with Big Data
TLDR
An in-house TIGER/Line based hierarchical geocoder, Intelius Address Parser (IAP) that provides on-par geocoded precision compared to on-line geocoding APIs is developed.
An evaluation framework for comparing geocoding systems
TLDR
The evaluation framework developed in this research is proven successful in differentiating between key capabilities of geocoding systems that are important in the context of a large organization with significant investments in geocode resources.
A Survey of Entity Resolution and Record Linkage Methodologies
TLDR
This study surveys the literature for the methodologies proposed or developed for entity resolution and record linkage and provides a foundation for solving many problems in data warehousing.
A. Survey of Entity Resolution and Record Linkage Methodologies
TLDR
This study surveys the literature for the methodologies proposed or developed for entity resolution and record linkage and provides a foundation for solving many problems in data warehousing.
A . Survey of Entity Resolution and Record Linkage Methodologies
TLDR
This study surveys the literature for the methodologies proposed or developed for entity resolution and record linkage and provides a foundation for solving many problems in data warehousing.
An Approach to Geocoding based on Volunteered Spatial Data
TLDR
The goal of the work summarized in this paper was to explore the suitabilty of freely available volunteered geographic information for the purpose of geocoding and the findings are able to serve as a proof of concept for the usage of volunteered spatial data as a reference dataset for geocoded services.
Multiple valued logic approach for matching patient records in multiple databases
An Approach of Standardization and Searching based on Hierarchical Bayesian Clustering (HBC) for Record Linkage System
  • Zin War Tun, N. Thein
  • Computer Science
    Fifth International Conference on Creating, Connecting and Collaborating through Computing (C5 '07)
  • 2007
TLDR
This paper proposes a record linkage framework and also focuses on standardization and enhance the searching method by adopting an advanced feature of cluster-based searching method called Hierarchical Bayesian Clustering (HBC), which is not only for more efficient record pair comparison, but also for speeding up the record linkage accuracy.
Improving Geocoding Match Rates with Spatially‐Varying Block Metrics
TLDR
The technical approach of a geocoding system that includes a nearby matching approach is described along with a method for scoring candidates based on spatially‐varying neighborhoods that indicates this approach is viable for improving match rates while maintaining acceptable levels of spatial accuracy.
On the suitability of Volunteered Geographic Information for the purpose of geocoding
TLDR
The goal of the work summarized in this paper was to explore the suitability of volunteered geographic information for the purpose of geocodes, and no freely available geocoding service offering house number level precision had so far been implemented based on volunteered geographic data.
...
1
2
3
...

References

SHOWING 1-10 OF 34 REFERENCES
TAILOR: a record linkage toolbox
TLDR
The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.
A Probabilistic Geocoding System based on a National Address File
TLDR
This paper describes a geocoding system that is based on a comprehensive high-quality geocoded national address database that uses a learning address parser based on hidden Markov models to separate free-form addresses into components, and a rule-based matching engine to determine the best set of candidate matches to a reference file.
Preparation of name and address data for record linkage using hidden Markov models
TLDR
Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses.
Decision Models for Record Linkage
TLDR
This paper reviews several existing decision models and then proposes an enhancement to cluster-based decision models, which achieves the same accuracy of existing models while significantly reducing the number of record pairs required for manual review.
An extensible Framework for Data Cleaning
TLDR
The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation.
Approximate String Comparison and its Effect on an Advanced Record Linkage System
TLDR
Overall matching efficacy is further improved by linear assignment algorithm that forces 1-1 matching.
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
TLDR
This paper develops a system for accomplishing this Data Cleansing task and demonstrates its use for cleansing lists of names of potential customers in a direct marketing-type application and reports on the successful implementation for a real-world database that conclusively validates results previously achieved for statistically generated data.
A Comparison of Fast Blocking Methods for Record Linkage
TLDR
This work compares two new blocking methods, bigram indexing and canopy clustering with TFIDF (Term Frequency/Inverse Document Frequency), with two older methods of standard traditional blocking and sorted neighbourhood blocking to show there is a potential for large performance speed-ups and better accuracy to be achieved by these newblocking methods.
USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE
TLDR
This paper describes a method for estimating weights using the EM Algorithm under less restrictive assumptions that automatically incorporates a Bayesian adjustment based on file characteristics.
The merge/purge problem for large databases
TLDR
This paper details the sorted neighborhood method that is used by some to solve merge/purge and presents experimental results that demonstrates this approach may work well in practice but at great expense, and shows a means of improving the accuracy of the results based upon a multi-pass approach.
...
1
2
3
4
...