(Almost) All of Entity Resolution

@article{Binette2020AlmostAO,
  title={(Almost) All of Entity Resolution},
  author={Olivier Binette and Rebecca C. Steorts},
  journal={Science advances},
  year={2020},
  volume={8 12},
  pages={
          eabi8021
        }
}
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme-integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as structured entity resolution (record linkage… 

d-blink: Distributed End-to-End Bayesian Entity Resolution

A principled model for scalable Bayesian ER, called “distributed Bayesian linkage” or d-blink, is proposed, which jointly performs blocking and ER without compromising posterior correctness.

Multifile Partitioning for Record Linkage and Duplicate Detection

A novel partition representation is used to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and previous models for comparison data are extended to accommodate the multi-million-dollar setting.

Regression with linked datasets subject to linkage error

An account of developments in methodology for dealing with linkage errors in regression analysis with linked datasets, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem is given.

A Knowledge Graph Embeddings based Approach for Author Name Disambiguation using Literals

A novel framework, Literally Author Name Disambiguation (LAND), is presented, which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs and shows competitive performances on a challenging benchmark such as AMiner.

Big Data is not the New Oil: Common Misconceptions about Population Data

A diverse range of misconceptions relevant for anybody capturing, processing, linking, or analysing population data are discussed, due to the social nature of data collections and are therefore missed by purely technical accounts of data processing.

Bayesian Causal Inference with Bipartite Record Linkage

It is shown the hierarchical model can improve the accuracy of estimated treatment effects, as well as the record linkages, compared to the two-stage modeling option and is illustrated using a causal study of the effects of debit card possession on household spending.

Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

A flexible exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and locally constrained shuffling is proposed and compares favorably to competing methods.

Locality Sensitive Hashing with Temporal and Spatial Constraints for Efficient Population Record Linkage

A novel method to improve the scalability and robustness of min- hash LSH for linking large population databases by exploiting temporal and spatial information available in personal data, and by filtering record pairs based on block sizes and min-hash similarity is presented.

Random Partition Models for Microclustering Tasks

This work proposes a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties and establishes theoretical properties of the resulting class of priors, where they characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size.

On the reliability of multiple systems estimation for the quantification of modern slavery

The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used

References

SHOWING 1-10 OF 221 REFERENCES

Efficient Interactive Training Selection for Large-Scale Entity Resolution

Experiments show that manual labeling efforts can be significantly reduced for training an ER classifier without compromising matching quality, and a noisy oracle where manual labeling might be incorrect is considered.

Crowdsourcing Algorithms for Entity Resolution

This paper considers the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked, and analyzes several strategies that can be claimed as "optimal" for this problem in a recent work but can perform arbitrarily bad in theory.

ZeroER: Entity Resolution using Zero Labeled Examples

This paper investigates an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches, and presents a proposed approach dubbed ZeroER.

d-blink: Distributed End-to-End Bayesian Entity Resolution

A principled model for scalable Bayesian ER, called “distributed Bayesian linkage” or d-blink, is proposed, which jointly performs blocking and ER without compromising posterior correctness.

Comparative Analysis of Approximate Blocking Techniques for Entity Resolution

This work considers 17 state-of-the-art blocking methods and uses 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency, and investigates their scalability over a corpus of 7 established synthetic datasets.

Active Learning for Probabilistic Record Linkage . ∗

An active learning algorithm is proposed for PRL, which efficiently incorporates human judgement into the process and significantly improves PRL’s performance at the cost of manually labelling a small number of records.

A Solution to the Problem of Linking Multivariate Documents

Some aspects of classifying pairs of documents into one of two populations when their items are identifying information, where each item of information can take on three distinct values correct, incorrect or missing, are considered.

The merge/purge problem for large databases

This paper details the sorted neighborhood method that is used by some to solve merge/purge and presents experimental results that demonstrates this approach may work well in practice but at great expense, and shows a means of improving the accuracy of the results based upon a multi-pass approach.

Performance Bounds for Graphical Record Linkage

This work critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, and provides an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

A fast and scalable algorithm is developed to implement a canonical model of probabilistic record linkage that has many advantages over deterministic methods frequently used by social scientists.
...