Entity Matching with Active Monotone Classification

  title={Entity Matching with Active Monotone Classification},
  author={Yufei Tao},
  journal={Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems},
  • Yufei Tao
  • Published 27 May 2018
  • Computer Science
  • Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the… 

Figures from this paper

Entity Matching with Quality and Error Guarantees
This article describes an algorithm that achieves the purpose of entity matching using the methodology of active monotone classification, and ensures an asymptotically optimal tradeoff between the number of pairs inspected and thenumber of mistakes made.
New Algorithms for Monotone Classification
It is proved that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1) and given an arbitrary $\eps > 0$, it is shown how to obtain a monotone classifier whose error is worse than the optimum by at most a $1 + \eps$ factor, while probing $\tO(w/\eps^2)$ labels.
Technical Perspective: Entity Matching with Quality and Error Guarantees
The challenge of entity matching is that of identifying when different data items refer to the same real-life entity when they come from different data sources that mention overlapping sets of entities.
Crowdsourced Collective Entity Resolution with Relational Match Propagation
This paper proposes a novel approach called crowdsourced collective ER, which iteratively asks human workers to label picked entity pairs and propagates the labeling information to their neighbors in distance and achieves superior accuracy with much less labeling.
Active Learning Based Similarity Filtering for Efficient and Effective Record Linkage
This paper presents a novel approach that, based on the expected number of true matches between two databases, applies active learning to remove compared record pairs that are likely non-matches before a computationally expensive classification or clustering algorithm is employed to classify record pairs.
Pattern Masking for Dictionary Matching
It is shown, through a reduction from the well-known $k$-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet.
Unsupervised Graph-based Entity Resolution for Accurate and Efficient Family Pedigree Search
A prototype application for automated family pedigree search that is based on unsupervised graph-based entity resolution techniques combined with approximate query matching and ranking methods to efficiently and accurately extract and visualise family pedigrees from searched birth or death certificates is presented.
Which conference is that? A case study in computer science
This article proposes a technique for the entity resolution of conferences based on the analysis of different semantic parts of their names, and presents the results of an investigation on a dataset of 42395 distinct computer science conference names excerpted from the DBLP computer science repository.
Effective Explanations for Entity Resolution Models
This paper proposes the certa approach, which builds on a probabilistic framework that aims at computing the explanations evaluating the outcomes produced by using perturbed copies of the input records, and produces both saliency explanations, which associate each attribute with a saliency score, and counterfactual explanation, which provide examples of values that can flip the prediction.


Active Sampling for Entity Matching with Guarantees
The main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions).
Efficient Entity Resolution with Adaptive and Interactive Training Data Selection
This work proposes an approach for training data selection for ER that exploits the cluster structure of the weight vectors calculated from compared record pairs, and adaptively selects an optimal number of informative training examples for manual labeling based on a user defined sampling error margin.
Evaluation of entity resolution approaches on real-world match problems
It is found that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.
Frameworks for entity matching: A comparison
Active Learning Literature Survey
This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.
On active learning of record matching packages
This work considers the problem of learning a record matching package (classifier) in an active learning setting, and presents new algorithms for this problem that overcome limitations.
Agnostic active learning
Theory of Disagreement-Based Active Learning
Recent advances in the understanding of the theoretical benefits of active learning are described, and implications for the design of effective active learning algorithms are described.
A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication
This article proposes a two-stage sampling selection strategy (T3S) that selects a reduced set of pairs to tune the deduplication process in large datasets and shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality when compared with state-of-the-art dedUplication methods in large dataset.
Interactive deduplication using active learning
This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.