• Corpus ID: 49417541

Making Open Data Transparent: Data Discovery on Open Data

  title={Making Open Data Transparent: Data Discovery on Open Data},
  author={Ren{\'e}e J. Miller and Fatemeh Nargesian and Erkang Zhu and Christina Christodoulakis and Ken Q. Pu and Periklis Andritsos},
  journal={IEEE Data Eng. Bull.},
Open Data plays a major role in open government initiatives. Governments around the world are adopting Open Data Principles promising to make their Open Data complete, primary, and timely. These properties make this data tremendously valuable. Open Data poses interesting new challenges for data integration research and we take a look at one of those challenges, data discovery. How can we find new data sets within this ever expanding sea of Open Data. How do we make this sea transparent? 

Tables from this paper

Big Data Analytics Framework for Predictive Analytics using Public Data with Privacy Preserving

A big data analytics framework for public data, called BDAP, was presented and predictive analytics for community need was presented considering data the spatial and temporal location while addressing the data issues such as missing values, privacy-preserving, and predictive modeling.

Data Lake Management: Challenges and Opportunities

This tutorial considers how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.

PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data

This work develops a visual analytic solution that enables data defenders to gain awareness about the disclosure risks in local, joinable data neighborhoods and demonstrates how PRIVEE can help emulate the attack strategies and diagnose disclosure risks through two case studies with data privacy experts.

Data Source Selection in Big Data Context

This paper proposes a novel methodology for the selectability of data sources, by both considering the presence and the absence of users' preferences, and shows its capability to find the subset of relevant and reliable sources with the lowest cost.

Effective and Scalable Data Discovery with NextiaJD

NextiaJD proposes a ranking of candidate pairs according to their join quality, which is based on a novel similarity measure that considers both containment and cardinality proportions between candidate attributes.

Scalable Data Discovery Using Profiles

This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes, and implements this approach in a system called NextiaJD, and presents extensive experiments to show the predictive performance and computational efficiency of this method.

Open dataset discovery using context-enhanced similarity search

In experimental evaluation, it is shown that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios.

Data Curation with Deep Learning

This vision paper explores how some of the fundamental innovations in deep learning could be leveraged to improve existing data curation solutions and to help build new ones and identifies interesting research opportunities.

Towards Scalable Data Discovery

This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes, and is able to scale-up to larger volumes of data.

Modular framework for similarity-based dataset discovery using external knowledge

This paper proposes a modular framework for rapid experimentation with methods for similarity-based dataset discovery that has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery.



DataHub: Collaborative Data Science & Dataset Version Management at Scale

This work proposes a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and a platform, DATAHUB, that gives users the able to perform collaborative data analysis building on this versions control system.

Interactive Navigation of Open Data Linkages

The Toronto Open Data Search system offers users a highly interactive experience making unrelated (and unlinked) dynamic collections of datasets appear as a richly connected cloud of data that can be navigated and combined easily in real time.

LabBook: Metadata-driven social collaborative data analysis

The key insight is to collect and use more metadata about all elements of the analytic ecosystem by means of an architecture and user experience that reduce the cost of contributing such metadata.

Aurum: A Data Discovery System

This paper introduces a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems, and introduces a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data.

A systematic review of open government data initiatives

Table Union Search on Open Data

This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.

Building Data Civilizer Pipelines with an Advanced Workflow Engine

The complete data preparation system, Data Civilizer, is presented, focusing on a new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools.

The Data Civilizer System

Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data.

Goods: Organizing Google's Datasets

GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them.

Data Integration for the Relational Web

Octopus is a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web, to offer the user a set of best-effort operators that automate the most labor-intensive tasks.