Human-in-the-loop Data Integration

@article{Li2017HumanintheloopDI,
  title={Human-in-the-loop Data Integration},
  author={Guoliang Li},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={10},
  pages={2006-2017}
}
  • Guoliang Li
  • Published 1 August 2017
  • Computer Science
  • Proc. VLDB Endow.
Data integration aims to integrate data in different sources and provide users with a unified view. However, data integration cannot be completely addressed by purely automated methods. We propose a hybrid human-machine data integration framework that harnesses human ability to address this problem, and apply it initially to the problem of entity matching. The framework first uses rule-based algorithms to identify possible matching pairs and then utilizes the crowd to refine these candidate… 

Figures and Tables from this paper

Hike: A Hybrid Human-Machine Method for Entity Alignment in Large-Scale Knowledge Bases
TLDR
A novel hybrid human-machine framework for large-scale KB integration and proves that this problem is NP-hard and proposes greedy algorithms to address this problem with an approximation ratio of 1--1/e.
Human-In-The-Loop Document Layout Analysis
TLDR
The learning system from reinforcement learning is revisited and a sample-based agent update strategy is designed, which effectively improves the agent’s ability to accept new samples, effectively reducing costs.
PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching
TLDR
PoWareMatch is designed that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results.
A partial-order-based framework for cost-effective crowdsourced entity resolution
TLDR
A cost-effective crowdsourced entity resolution framework, which significantly reduces the monetary cost while keeping high quality, and develops error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd.
Toward a System Building Agenda for Data Integration (and Data Science)
TLDR
This paper argues that the data management community should devote far more effort to building data integration (DI) systems, in order to truly advance the field, and proposes an agenda to build a new kind of DI systems to address these limitations.
NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data
TLDR
Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines, is presented, based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning models.
i-HUMO: An Interactive Human and Machine Cooperation Framework for Entity Resolution with Quality Guarantees
TLDR
i-HUMO is a major improvement over HUMO in that it is interactive: its process of human workload selection is optimized based on real-time risk analysis on human-labeled results as well as pre-specified machine metrics.
Entity resolution on-demand
TLDR
BrewER is proposed, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, and inherently supports top-k and stop-and-resume execution.
Effective and Efficient Data Cleaning for Entity Matching
TLDR
The proposed domain-independent cleaning framework aims to save human users' time, by guiding them in cleaning the EM inputs in an attribute order that is as conducive to maximizing EM accuracy as possible within a given constraint on the time they spend on cleaning.
...
...

References

SHOWING 1-10 OF 86 REFERENCES
Deco: declarative crowdsourcing
TLDR
This work presents Deco, a database system for declarative crowdsourcing, and describes Deco's data model, query language, and the Deco query processor which uses a novel push-pull hybrid execution model to respect theDeco semantics while coping with the unique combination of latency, monetary cost, and uncertainty introduced in the crowdsourcing environment.
Source-aware Entity Matching: A Compositional Approach
TLDR
This work proposes viewing entity matching as a composition of basic steps into a "match execution plan", and analyzes formal properties of the plan space, and shows how to find a good match plan.
Human-in-the-Loop Challenges for Entity Matching: A Midterm Report
TLDR
This paper shows how the challenges of EM forced us to revise the authors' solution architecture, from a typical RDBMS-style architecture to a very human-centric one, in which human users are first-class objects driving the EM process, using tools at pain-point places.
Leveraging transitive relations for crowdsourced joins
TLDR
This paper proposes a hybrid transitive-relations and crowdsourcing labeling framework which aims to crowdsource the minimum number of pairs to label all the candidate pairs, and proves the optimal labeling order and devise a parallel labeling algorithm to efficientlyrowdsource the pairs following the order.
CrowdER: Crowdsourcing Entity Resolution
TLDR
This work proposes a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are use to verify only the most likely matching pairs, and develops a novel two-tiered heuristic approach for creating batched tasks.
Magellan: Toward Building Entity Matching Management Systems
TLDR
Magellan is a new kind of EM system that provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users execute these steps.
CrowdDB: answering queries with crowdsourcing
TLDR
The design of CrowdDB is described, a major change is that the traditional closed-world assumption for query processing does not hold for human input, and important avenues for future work in the development of crowdsourced query processing systems are outlined.
Reducing Uncertainty of Schema Matching via Crowdsourcing
TLDR
This work develops two novel approaches, namely "Single CCQ" and "Multiple CCQ", which adaptively select, publish and manage the questions and proposes frameworks and efficient algorithms to dynamically manage the CCQs, in order to maximize the uncertainty reduction within a limited budget of questions.
BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution
TLDR
It is demonstrated how "loose" schema information can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm.
Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach
TLDR
A cost-effective crowdsourced entity resolution framework, which significantly reduces the monetary cost while keeping high quality, and develops error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd.
...
...