A hierarchical Bayesian approach to record linkage and population size problems

@article{Tancredi2011AHB,
  title={A hierarchical Bayesian approach to record linkage and population size problems},
  author={Andrea Tancredi and Brunero Liseo},
  journal={The Annals of Applied Statistics},
  year={2011},
  volume={5},
  pages={1553-1585}
}
We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually… 

Figures and Tables from this paper

Practical Bayesian Inference for Record Linkage
TLDR
A new computational approach is proposed, providing both a fast algorithm for deriving point estimates of the linkage structure that properly account for one-to-one matching and a restricted MCMC algorithm that samples from an approximate posterior distribution.
Regression analysis with linked data: problems and possible solutions
TLDR
The record linkage process is framed into a formal statistical model which comprises both the matching variables and the other variables included at the inferential stage, and this feedback effect is both essential to eliminate potential biases that otherwise would characterize the resulting linked data inference and able to improve record linkage performances.
Scalable Bayesian Record Linkage
TLDR
A new computationally efficient and scalable approach is proposed, providing both a fast algorithm for generating a point estimate of the linkage structure that properly accounts for one-to-one matching and a restricted MCMC algorithm that samples from an approximate posterior distribution.
Accounting for matching uncertainty in T-stage capture-recapture models for population size estimation
In this paper we illustrate a Bayesian hierarchical modelling approach for estimating the size of a closed population by capture-recapture models when the number of recaptured individuals is unknown
A Bayesian Approach to Graphical Record Linkage and Deduplication
TLDR
An unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files, which lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches.
SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication
TLDR
A novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files, to represent the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records.
Bayesian Estimation of Bipartite Matchings for Record Linkage
TLDR
This paper argues that this independence assumption in the matching statuses of record pairs is unreasonable and proposes partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved and demonstrates the advantages of these methods merging two datafiles on casualties from the civil war of El Salvador.
Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An Application to the California Great Registers
TLDR
A new computational approach is proposed, providing both a fast algorithm for deriving point estimates of the linkage structure that properly account for one-to-one matching and a restricted MCMC algorithm that samples from an approximate posterior distribution.
Bayesian Parametric and Nonparametric Inference for Multiple Record Linkage
TLDR
This work proposes Bayesian parametric and nonparametric methodology for multiples files in which the fields are regarded as independent and estimates the posterior distribution of this linkage structure via a hybrid Markov chain Monte Carlo (MCMC) algorithm.
...
...

References

SHOWING 1-10 OF 70 REFERENCES
MODELLING ISSUES IN RECORD LINKAGE : A BAYESIAN PERSPECTIVE
TLDR
This paper uses standard MCMC algorithms to derive the marginal posterior distribution of a matrix-valued parameter which indicates the “configuration” of matches between the two lists and proposes a fully Bayesian approach.
Advances in Record Linkage Theory: Hierarchical Bayesian Record Linkage Theory
TLDR
Bayesian record linkage alternatives are developed that allow parameters to vary by file blocks, which are similar to geographical blocks in census applications and 1-1 matching between files into the likelihood itself and computing posterior distributions of parameters and linkage indicators.
A method for calibrating false-match rates in record linkage
TLDR
A general strategy for accurately estimating false-match rates for each possible cutoff weight and uses a model where the distribution of observed weights are viewed as a mixture of weights for true matches and weights for false matches.
Bayesian estimation of population size via linkage of multivariate normal data sets
TLDR
The proposed methodology can be profitably adopted in record linkage and in capture-recapture problems where the size of a finite population is the main object of interest and the number of “recaptured” individuals is unknown.
Regression Analysis With Linked Data
Record linkage, or exact matching, can be used to join together two files that contain information on the same individuals but lack unique personal identification codes. The possibility of errors in
Iterative Automated Record Linkage Using Mixture Models
TLDR
A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review.
Classical multilevel and Bayesian approaches to population size estimation using multiple lists
TLDR
This framework encompasses both the traditional log‐linear approach and various elements from the full Rasch model and explores extensions allowing for interactions between the Rasch and log‐ linear portions of the models in both the classical and the Bayesian contexts.
Data quality and record linkage techniques
TLDR
This book helps practitioners gain a deeper understanding, at an applied level, of the issues involved in improving data quality through editing, imputation, and record linkage through the Fellegi-Holt edit-imputation model, the Little-Rubin multiple-imPUTation scheme, and the FelLegi-Sunter record linkage model.
Uncovering a Latent Multinomial: Analysis of Mark–Recapture Data with Misidentification
TLDR
This work presents a general framework for Bayesian analysis of categorical data arising from a latent multinomial distribution and illustrates the approach using two data sets with individual misidentification, one simulated, the other summarizing recapture data for salamanders based on natural marks.
Bayesian alignment using hierarchical models, with applications in protein bioinformatics
TLDR
This paper introduces hierarchical models for shape analysis tasks, in which the points in the configurations are either unlabelled or have at most a partial labelling constraining the matching, and in which some points may only appear in one of the configurations.
...
...