• Corpus ID: 20710423

Toward a System Building Agenda for Data Integration (and Data Science)

@article{Doan2017TowardAS,
  title={Toward a System Building Agenda for Data Integration (and Data Science)},
  author={AnHai Doan and A. Ardalan and Jeffrey R. Ballard and Sanjib Das and Yash Govind and Pradap Konda and Han Li and Erik Paulson and C. PaulSuganthanG. and Haojun Zhang},
  journal={IEEE Data Eng. Bull.},
  year={2017},
  volume={41},
  pages={35-46}
}
In this paper we argue that the data management community should devote far more effort to building data integration (DI) systems, in order to truly advance the field. Toward this goal, we make three contributions. First, we draw on our recent industrial experience to discuss the limitations of current DI systems. Second, we propose an agenda to build a new kind of DI systems to address these limitations. These systems guide users through the DI workflow, step by step. They provide tools to… 

Figures from this paper

Entity Matching Meets Data Science: A Progress Report from the Magellan Project

It is argued why EM can be viewed as a special class of DS problems, and thus can benefit from system building ideas in DS, and how these ideas have been adapted to build pymatcher and cloudmatcher, EM tools for power users and lay users.

Magellan

It is argued why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS, and how these ideas have been adapted to build PyMatcher and CloudMatcher, sophisticated on-premise tools for power users and self-service cloud tools for lay users.

Big Data Semantics

It is argued that methods, principles, and perspectives developed by the Data Semantics community can significantly contribute to address Big Data challenges.

NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data

Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines, is presented, based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning models.

AlphaClean: Automatic Generation of Data Cleaning Pipelines

A framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines, which is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators.

Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications

This chapter addresses three issues in an integrated way in dealing with data coming from the Web, which involves solving missing and duplicate data, normalization, data veracity, etc.

Effective and Efficient Data Cleaning for Entity Matching

The proposed domain-independent cleaning framework aims to save human users' time, by guiding them in cleaning the EM inputs in an attribute order that is as conducive to maximizing EM accuracy as possible within a given constraint on the time they spend on cleaning.

Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach

This work proposes an automated privacy policy question answering assistant that extracts a summary in response to the input user query by paraphrasing to bring the style and language of the user’s question closer to the language of privacy policies.

Smurf: Self-Service String Matching Using Random Forests

Smurf, a self-service SM solution that reduces the labeling effort by 43-76%, yet achieves comparable F1 accuracy, is developed, a novel solution to efficiently execute a random forest over two sets of strings that Smurf learns via active learning with the lay user.

Special Topics in Multimedia, IoT and Web Technologies

The proposal contains the core of the proposal, with the extensions that are proposed to the NCM model, and the viability of applying the model is discussed.

Technical Perspective:: Toward Building Entity Matching Management Systems

Magellan is a new kind of EM system that provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users execute these steps.

Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks

This paper discusses the limitations of current EM systems, presents Magellan, a new kind of EM systems that addresses these limitations, and proposes demonstration scenarios that show the promise of the Magellan approach.

Principles of Data Integration

The Power Behind the Throne: Information Integration in the Age of Data-Driven Discovery

    L. Haas
    Computer Science
    SIGMOD Conference
  • 2015
The environment the lab is building is described as an integration hub for data, people and applications that allows users to import, explore and create data and knowledge, inspired by the work of others, while it captures the patterns of decision-making and the provenance of decisions.

The Data Civilizer System

Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data.

Human-in-the-Loop Challenges for Entity Matching: A Midterm Report

This paper shows how the challenges of EM forced us to revise the authors' solution architecture, from a typical RDBMS-style architecture to a very human-centric one, in which human users are first-class objects driving the EM process, using tools at pain-point places.

BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration

It is hoped that as more software packages are added to BIGGORILLA, it will become a one-stop resource for both researchers and industry practitioners, and will enable the community to advance the state of the art at a faster pace.

Smoke: Fine-grained Lineage at Interactive Speed

Smoke is introduced, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised and can meet the latency requirements of interactive visualizations and outperform hand-written implementations of data profiling primitives.

Big data integration

    X. DongD. Srivastava
    Computer Science
    2013 IEEE 29th International Conference on Data Engineering (ICDE)
  • 2013
This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

From databases to dataspaces: a new abstraction for information management

This paper proposes dataspaces and their support systems as a new agenda for data management, which encompasses much of the work going on in data management today, while posing additional research objectives.