Corpus ID: 49417541

Making Open Data Transparent: Data Discovery on Open Data

@article{Miller2018MakingOD,
  title={Making Open Data Transparent: Data Discovery on Open Data},
  author={Ren{\'e}e J. Miller and Fatemeh Nargesian and Erkang Zhu and Christina Christodoulakis and Ken Q. Pu and Periklis Andritsos},
  journal={IEEE Data Eng. Bull.},
  year={2018},
  volume={41},
  pages={59-70}
}
Open Data plays a major role in open government initiatives. Governments around the world are adopting Open Data Principles promising to make their Open Data complete, primary, and timely. These properties make this data tremendously valuable. Open Data poses interesting new challenges for data integration research and we take a look at one of those challenges, data discovery. How can we find new data sets within this ever expanding sea of Open Data. How do we make this sea transparent? 
Data Lake Management: Challenges and Opportunities
TLDR
This tutorial considers how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management. Expand
Data Source Selection in Big Data Context
TLDR
This paper proposes a novel methodology for the selectability of data sources, by both considering the presence and the absence of users' preferences, and shows its capability to find the subset of relevant and reliable sources with the lowest cost. Expand
Effective and Scalable Data Discovery with NextiaJD
TLDR
NextiaJD proposes a ranking of candidate pairs according to their join quality, which is based on a novel similarity measure that considers both containment and cardinality proportions between candidate attributes. Expand
Scalable Data Discovery Using Profiles
TLDR
This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes, and implements this approach in a system called NextiaJD, and presents extensive experiments to show the predictive performance and computational efficiency of this method. Expand
Data Curation with Deep Learning
TLDR
This vision paper explores how some of the fundamental innovations in deep learning could be leveraged to improve existing data curation solutions and to help build new ones and identifies interesting research opportunities. Expand
Towards Scalable Data Discovery
TLDR
This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes, and is able to scale-up to larger volumes of data. Expand
Evaluation Framework for Search Methods Focused on Dataset Findability in Open Data Catalogs
TLDR
A framework for evaluating findability of datasets, regardless of retrieval models used, is proposed and a proof of concept specification and evaluation on several similarity-based retrieval models and several dataset discovery scenarios within a catalog are presented. Expand
Data Curation with Deep Learning [Vision]
TLDR
A thorough overview of the current deep learning landscape is provided, and some of the fundamental innovations in deep learning could be leveraged to improve existing data curation solutions and to help build new ones. Expand
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks
TLDR
A compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world is described and how to derive sentences from such a graph that effectively "describe" the similarity across elements in the two datasets is proposed. Expand
Data Lake Organization
TLDR
Through a formal user study, it is shown that navigation can help users discover relevant tables that cannot be found by keyword search and that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization. Expand
...
1
2
...

References

SHOWING 1-10 OF 51 REFERENCES
DataHub: Collaborative Data Science & Dataset Version Management at Scale
TLDR
This work proposes a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and a platform, DATAHUB, that gives users the able to perform collaborative data analysis building on this versions control system. Expand
Interactive Navigation of Open Data Linkages
TLDR
The Toronto Open Data Search system offers users a highly interactive experience making unrelated (and unlinked) dynamic collections of datasets appear as a richly connected cloud of data that can be navigated and combined easily in real time. Expand
LabBook: Metadata-driven social collaborative data analysis
TLDR
The key insight is to collect and use more metadata about all elements of the analytic ecosystem by means of an architecture and user experience that reduce the cost of contributing such metadata. Expand
Aurum: A Data Discovery System
TLDR
This paper introduces a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems, and introduces a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data. Expand
A systematic review of open government data initiatives
TLDR
The open government data life-cycle is described and a discussion on publishing and consuming processes required within open governmentData initiatives is focused on, and guidelines for publishing data are provided and an integrated overview is provided. Expand
Table Union Search on Open Data
TLDR
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes. Expand
Building Data Civilizer Pipelines with an Advanced Workflow Engine
TLDR
The complete data preparation system, Data Civilizer, is presented, focusing on a new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Expand
The Data Civilizer System
TLDR
Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data. Expand
Goods: Organizing Google's Datasets
TLDR
GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. Expand
Data Integration for the Relational Web
TLDR
Octopus is a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web, to offer the user a set of best-effort operators that automate the most labor-intensive tasks. Expand
...
1
2
3
4
5
...