Interactive Navigation of Open Data Linkages

  title={Interactive Navigation of Open Data Linkages},
  author={Erkang Zhu and Ken Q. Pu and Fatemeh Nargesian and Ren{\'e}e J. Miller},
  journal={Proc. VLDB Endow.},
We developed Toronto Open Data Search to support the ad hoc, interactive discovery of connections or linkages between datasets. It can be used to efficiently navigate through the open data cloud. Our system consists of three parts: a user-interface provided by a Web application; a scalable backend infrastructure that supports navigational queries; and a dynamic repository of open data tables. Our system uses LSH Ensemble, an efficient index structure, to compute linkages (attributes in two… 

Figures from this paper

RONIN: Data Lake Exploration

RONIN is demonstrated, a tool that enables user exploration of a data lake by seamlessly integrating the two common modalities of discovery: data set search and navigation of a hierarchical structure.

Data Lake Organization

Through a formal user study, it is shown that navigation can help users discover relevant tables that cannot be found by keyword search and that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization.

Turning Open Government Data Portals into Interactive Databases

This thesis presents Governor, a web application developed to make open data tables more accessible to the end users in several ways, and provides a set of features to summarize the provenance of integrated tables allowing users and their collaborators to easily trace back the values in integrated tables to the original tables in the OGDP.

Optimizing Organizations for Navigating Data Lakes

It is shown that navigation can help users discover relevant tables that cannot be found by keyword search and in this study, 42% of users preferred the use of navigation and 58% preferred keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes.

Organizing Data Lakes for Navigation

A new probabilistic model of how users interact with an organization is presented and an approximate algorithm for the data lake organization problem is proposed that can help users find relevant tables that cannot be found by keyword search.

Open Data Integration

A new paradigm for thinking about integration is introduced where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs.

KTabulator: Interactive Ad hoc Table Creation using Knowledge Graphs

KTabulator is an interactive system to effectively extract, build, or extend ad hoc tables from large corpora, by leveraging their computerized structures in the form of knowledge graphs.

A Semantic Data Lake Model for Analytic Query-Driven Discovery

A semantic model for a Data Lake aimed to support data discovery and integration in data analytics scenarios is introduced, suited for identifying the sources and the required transformation steps according to the analytical request.

Auctus: A Dataset Search Engine for Data Discovery and Augmentation

This demo presents the ongoing efforts to develop a dataset search engine tailored for data augmentation, named Auctus, which automatically discovers datasets on the Web and, different from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries.

Search and Join Algorithms for Tables in Data Lakes

This thesis describes two problems in managing data lakes and proposes a technique that generates transformations, without human input, for joining tables with different formats on the join columns, which makes data lakes more searchable and usable, and allow data scientists to be efficient.



Discovering Linkage Points over Web Data

The basic schema-matching step is replaced with a more complex instance-based schema analysis and linkage discovery, and it is shown that even attributes with different meanings can sometimes be useful in aligning data.

Finding related tables

This work considers the problem of finding related tables in a large corpus of heterogenous tables and proposes a framework that captures several types of relatedness, including tables that are candidates for joins and tables that is candidates for union.

A Large Public Corpus of Web Tables containing Time and Context Metadata

A large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl is presented to provide a common ground for evaluating Web table systems.

Mining of Massive Datasets

This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.

LSH Ensemble: Internet-Scale Domain Search

It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.

Approximate nearest neighbors: towards removing the curse of dimensionality

Two algorithms for the approximate nearest neighbor problem in high-dimensional spaces are presented, which require space that is only polynomial in n and d, while achieving query times that are sub-linear inn and polynometric in d.

On the resemblance and containment of documents

  • A. Broder
  • Computer Science
    Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
  • 1997
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.