A note on using the F-measure for evaluating record linkage algorithms

@article{Hand2018ANO,
  title={A note on using the F-measure for evaluating record linkage algorithms},
  author={David J. Hand and Peter Christen},
  journal={Statistics and Computing},
  year={2018},
  volume={28},
  pages={539-547}
}
Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques—including supervised, unsupervised, semi-supervised and active learning based—have been employed for record… 
Using Metric Space Indexing for Complete and Efficient Record Linkage
TLDR
An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.
Large Scale Record Linkage in the Presence of Missing Data
TLDR
This work proposes a novel technique that can accurately link records even when QID values contain errors or variations, or are missing, and demonstrates that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.
Active Learning Based Similarity Filtering for Efficient and Effective Record Linkage
TLDR
This paper presents a novel approach that, based on the expected number of true matches between two databases, applies active learning to remove compared record pairs that are likely non-matches before a computationally expensive classification or clustering algorithm is employed to classify record pairs.
Evaluation measure for group-based record linkage
TLDR
This work highlights the shortcomings of traditional evaluation measures and proposes a novel method to evaluate clustering quality in the context of group-based record linkage, consisting of seven categories which reflect how each record was predicted, providing more detailed information about the quality of the linkage result.
Maximum Entropy classification for record linkage
TLDR
This paper approaches record linkage as a classification problem, and adapt the maximum entropy classification method in text mining to record linkage, both in the supervised and unsupervised settings of machine learning.
A scalable privacy-preserving framework for temporal record linkage
TLDR
A scalable framework for privacy-preserving temporal record linkage that can link different databases while ensuring the privacy of sensitive data in these databases while providing privacy to individuals in the databases that are being linked.
Informativeness-Based Active Learning for Entity Resolution
TLDR
This work proposes a novel active learning approach that does not require any prior knowledge about true matches and that is independent of the learning method used, and can outperform previous active learning approaches for entity resolution.
Developing a Temporal Bibliographic Data Set for Entity Resolution
TLDR
This paper describes the preparation of a temporal data set based on author profiles extracted from the Digital Bibliography and Library Project (DBLP) using the Microsoft Academic Graph to link temporal affiliation information for DBLP authors.
Robust Temporal Graph Clustering for Group Record Linkage
TLDR
A novel temporal clustering approach aimed at linking records of the same group (such as all births by the same mother) where temporal constraints need to be enforced and an iterative merging step which considers temporal constraints to obtain accurate clustering results are presented.
Temporal graph-based clustering for historical record linkage
TLDR
An existing clustering technique for record linkage is extended by incorporating temporal constraints that must hold between births by the same mother, and a novel greedy temporal clustersering technique is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Iterative Automated Record Linkage Using Mixture Models
The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more
Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection / Peter Christen
TLDR
This book helps researchers as well as students specializing in data quality or data matching aspects to familiarize themselves with recent research advances and to identify open research challenges in the area of data matching.
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
  • P. Christen
  • Computer Science
    IEEE Transactions on Knowledge and Data Engineering
  • 2012
TLDR
A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.
Efficient Entity Resolution with Adaptive and Interactive Training Data Selection
TLDR
This work proposes an approach for training data selection for ER that exploits the cluster structure of the weight vectors calculated from compared record pairs, and adaptively selects an optimal number of informative training examples for manual labeling based on a user defined sampling error margin.
A method for calibrating false-match rates in record linkage
Abstract Specifying a record-linkage procedure requires both (1) a method for measuring closeness of agreement between records, typically a scalar weight, and (2) a rule for deciding when to classify
A taxonomy of privacy-preserving record linkage techniques
TLDR
This paper presents an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data, and presents a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions.
Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
  • Jared Murray
  • Computer Science, Mathematics
    J. Priv. Confidentiality
  • 2015
TLDR
The implications of indexing, blocking and filtering within the popular Fellegi-Sunter framework are reviewed, and a new model to account for particular forms ofindexing and filtering is proposed.
Development and user experiences of an open source data cleaning, deduplication and record linkage system
TLDR
An overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Fe brl users is discussed.
Quality and Complexity Measures for Data Linkage and Deduplication
TLDR
An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results.
A Theory for Record Linkage
Abstract A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical
...
1
2
3
4
...