Pay-as-you-go Configuration of Entity Resolution

@article{Maskat2016PayasyougoCO,
  title={Pay-as-you-go Configuration of Entity Resolution},
  author={Ruhaila Maskat and Norman W. Paton and Suzanne M. Embury},
  journal={Trans. Large Scale Data Knowl. Centered Syst.},
  year={2016},
  volume={29},
  pages={40-65}
}
Entity resolution, which seeks to identify records that represent the same entity, is an important step in many data integration and data cleaning applications. However, entity resolution is challenging both in terms of scalability all-against-all comparisons are computationally impractical and result quality syntactic evidence on record equivalence is often equivocal. As a result, end-to-end entity resolution proposals involve several stages, including blocking to efficiently identify… 

Benchmarking Filtering Techniques for Entity Resolution

TLDR
This work performs the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets, and considers a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets.

A Survey on Blocking Technology of Entity Resolution

TLDR
This paper summarizes and analyzes all classic blocking methods with emphasis on different blocking construction and optimization techniques, and finds that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces.

How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?

TLDR
This work performs the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets, demonstrating the superiority of blocking workflows and string similarity joins.

Blocking and Filtering Techniques for Entity Resolution

TLDR
A large number of relevant works under two different but related frameworks: Blocking and Filtering are reviewed, finding the most promising directions for future work in the field.

Entity Resolution: Past, Present and Yet-to-Come

TLDR
This tutorial presents the ER generations by discussing past, present, and yet-to-come mechanisms, and outlines the corresponding ER workflow along with the state-of-the-art methods per workflow step.

Blocking and Filtering Techniques for Entity Resolution: A Survey

TLDR
This survey organized the bulk of works in the field into Blocking, Filtering and hybrid techniques, facilitating their understanding and use, and provided an in-dept coverage of each category, further classifying the corresponding works into novel sub-categories.

A Survey of Blocking and Filtering Techniques for Entity Resolution

TLDR
This survey organized the bulk of works in the field into Blocking, Filtering and hybrid techniques, facilitating their understanding and use, and provided an in-dept coverage of each category, further classifying the corresponding works into novel sub-categories.

Cost–effective Variational Active Entity Resolution

TLDR
An entity resolution method that builds on the robustness conferred by deep autoencoders to reduce human–involvement costs and unveils a transferability property of the resulting model that can further reduce the cost of applying the approach to new datasets by means of transfer learning.

Aalborg Universitet Multi-Source Spatial Entity Linkage

TLDR
A novel algorithm is proposed, referred to as SkyEx that separates the pairs considered as a match (positive class) from the rest (negative class) by using Pareto optimality and provides the best trade-off between precision and recall and consequently, the best F-measure compared to the existing baselines.

Feedback Driven Improvement of Data Preparation Pipelines

References

SHOWING 1-10 OF 33 REFERENCES

Pay-As-You-Go Entity Resolution

TLDR
This paper investigates how to maximize the progress of ER with a limited amount of work using “hints,” which give information on records that are likely to refer to the same real-world entity.

Crowdsourcing Algorithms for Entity Resolution

TLDR
This paper considers the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked, and analyzes several strategies that can be claimed as "optimal" for this problem in a recent work but can perform arbitrarily bad in theory.

Scalable entity matching computation with materialization

TLDR
A scalable EM algorithm that employs a pre-materialized structure that can identify the EM results with sub-linear cost and efficiently adapt to new rules by selectively accessing records using the materialized structure.

Evaluation of entity resolution approaches on real-world match problems

TLDR
It is found that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

Learning Blocking Schemes for Record Linkage

TLDR
This paper presents a machine learning approach to automatically learn effective blocking schemes and validate the approach with experiments that show the learned blocking schemes outperform the ad-hoc blocking schemes of non-experts and perform comparably to those manually built by a domain expert.

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

TLDR
The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.

Framework for Evaluating Clustering Algorithms in Duplicate Detection

TLDR
This work uses Stringer to evaluate the quality of the clusters obtained from several unconstrained clustering algorithms used in concert with approximate join techniques and reveals that some clustering algorithm that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.

Question Selection for Crowd Entity Resolution

TLDR
A probabilistic framework for ER is proposed that can be used to estimate how much ER accuracy the authors obtain by asking each question and select the best question with the highest expected accuracy by computing the expected accuracy.

Frameworks for entity matching: A comparison

CrowdER: Crowdsourcing Entity Resolution

TLDR
This work proposes a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are use to verify only the most likely matching pairs, and develops a novel two-tiered heuristic approach for creating batched tasks.