Organizing Data Lakes for Navigation

  title={Organizing Data Lakes for Navigation},
  author={Fatemeh Nargesian and Ken Q. Pu and Erkang Zhu and Bahar Ghadiri Bashardoost and Ren{\'e}e J. Miller},
  journal={Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
  • F. NargesianK. Pu Renée J. Miller
  • Published 29 May 2020
  • Computer Science
  • Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We propose the data lake organization problem as the problem of finding an organization that allows a user to most effectively navigate a data lake. We present a new probabilistic model of how users interact with an organization and… 

Figures and Tables from this paper

RONIN: Data Lake Exploration

RONIN is demonstrated, a tool that enables user exploration of a data lake by seamlessly integrating the two common modalities of discovery: data set search and navigation of a hierarchical structure.

Towards Schema Inference for Data Lakes

This paper makes use of approximate indexes that can be used for data discovery to inform the inference of a schema for a data lake, consisting of entity types and the relationships between them, and identifies candidate entity types by clustering similar data sets from the data lake.

Data lake concept and systems: a survey

This survey reviews the development, definition, and architectures of data lakes and classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes.

Towards Learned Metadata Extraction for Data Lakes

This paper shows the result of a study when applying Sato — a recent approach based on deep learning — to a real-world data set and proposes a new direction of using weak supervision and presents results of an initial prototype built to generate labeled training data with low manual efforts to improve the performance of learned semantic type extraction approaches on new unseen data sets.

Fast Dataset Search with Earth Mover's Distance

This paper proposes a Dual-Bound Filtering (DBF) framework to accelerate the EMD-based spatial dataset search by using the Earth Mover's Distance to measure the similarity between datasets.

Scalable Data Discovery Using Profiles

This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes, and implements this approach in a system called NextiaJD, and presents extensive experiments to show the predictive performance and computational efficiency of this method.

WebLens: Towards Interactive Large-scale Structured Data Profiling

WebLens trained models significantly outperform 20 people on the task of construction of metadata-profiles for 10 objects from different domains and significantly simplify access to large-scale structured datasets for both data scientists and end users.

A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science

This paper will showcase how KGLac facilitates data discovery and enrichment while developing an ML pipeline to evaluate potential gender salary bias in IT jobs, and harness a broad range of Machine Learning (ML) approaches with GLac to enable automatic graph learning for advanced and semantic data discovery.

Automatic Tag Recommendation for the UN Humanitarian Data Exchange

An approach for automatic tag recommendation for dataset repositories is developed and the integration of the model is demonstrated in the The Humanitarian Data Exchange, a real-world dataset repository in the social and humanitarian domains.

Software Foundations for Data Interoperability and Large Scale Graph Data Analytics: 4th International Workshop, SFDI 2020, and 2nd International Workshop, LSGDA 2020, held in Conjunction with VLDB 2020, Tokyo, Japan, September 4, 2020, Proceedings

Three attribute diversified community models are introduced in which attribute diversification takes different roles for presenting objective, query requirement, and constraint in order to find communities that are both structure and attribute cohesive.



Data Lake Organization

Through a formal user study, it is shown that navigation can help users discover relevant tables that cannot be found by keyword search and that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization.

Interactive Navigation of Open Data Linkages

The Toronto Open Data Search system offers users a highly interactive experience making unrelated (and unlinked) dynamic collections of datasets appear as a richly connected cloud of data that can be navigated and combined easily in real time.

Facet discovery for structured web search: a query-log mining approach

This paper model the user faceted-search behavior using the intersection of web query-logs with existing structured data and presents an automated solution that elicits user preferences on attributes and values, employing different disambiguation techniques ranging from simple keyword matching, to more sophisticated probabilistic models.

Table Union Search on Open Data

This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.

JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes

The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes.

The Data Civilizer System

Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data.

Aurum: A Data Discovery System

This paper introduces a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems, and introduces a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data.

Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks

This paper shows how to efficiently compute a sum over the exponential number of networks that are consistent with a fixed order over network variables, and uses this result as the basis for an algorithm that approximates the Bayesian posterior of a feature.

Recovering Semantics of Tables on the Web

A system that attempts to recover the semantics of tables by enriching the table with additional annotations, which leverages a database of class labels and relationships automatically extracted from the Web.

Answering Table Queries on the Web using Column Keywords

The design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns is presented and a novel query segmentation model for matching keywords to table columns is defined.