Organizing Data Lakes for Navigation

@article{Nargesian2020OrganizingDL,
  title={Organizing Data Lakes for Navigation},
  author={Fatemeh Nargesian and Ken Q. Pu and Erkang Zhu and Bahar Ghadiri Bashardoost and Ren{\'e}e J. Miller},
  journal={Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
  year={2020}
}
  • F. Nargesian, K. Pu, +2 authors Renée J. Miller
  • Published 29 May 2020
  • Computer Science
  • Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We propose the data lake organization problem as the problem of finding an organization that allows a user to most effectively navigate a data lake. We present a new probabilistic model of how users interact with an organization and… Expand
RONIN: Data Lake Exploration
TLDR
RONIN is demonstrated, a tool that enables user exploration of a data lake by seamlessly integrating the two common modalities of discovery: data set search and navigation of a hierarchical structure. Expand
Data lake concept and systems: a survey
TLDR
This survey reviews the development, definition, and architectures of data lakes and classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes. Expand
Towards Learned Metadata Extraction for Data Lakes
TLDR
This paper shows the result of a study when applying Sato — a recent approach based on deep learning — to a real-world data set and proposes a new direction of using weak supervision and presents results of an initial prototype built to generate labeled training data with low manual efforts to improve the performance of learned semantic type extraction approaches on new unseen data sets. Expand
Scalable Data Discovery Using Profiles
TLDR
This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes, and implements this approach in a system called NextiaJD, and presents extensive experiments to show the predictive performance and computational efficiency of this method. Expand
A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science
TLDR
This paper will showcase how KGLac facilitates data discovery and enrichment while developing an ML pipeline to evaluate potential gender salary bias in IT jobs, and harness a broad range of Machine Learning (ML) approaches with GLac to enable automatic graph learning for advanced and semantic data discovery. Expand
Automatic Tag Recommendation for the UN Humanitarian Data Exchange
TLDR
An approach for automatic tag recommendation for dataset repositories is developed and the integration of the model is demonstrated in the The Humanitarian Data Exchange, a real-world dataset repository in the social and humanitarian domains. Expand
Software Foundations for Data Interoperability and Large Scale Graph Data Analytics: 4th International Workshop, SFDI 2020, and 2nd International Workshop, LSGDA 2020, held in Conjunction with VLDB 2020, Tokyo, Japan, September 4, 2020, Proceedings
TLDR
Three attribute diversified community models are introduced in which attribute diversification takes different roles for presenting objective, query requirement, and constraint in order to find communities that are both structure and attribute cohesive. Expand
MATE: Multi-Attribute Table Extraction
TLDR
This paper introduces MATE, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a spaceefficient super key, and designs a filtering layer that uses a novel Hash function, Xash, which allows the system to efficiently prune tables with non-joinable rows. Expand
Effective and Scalable Data Discovery with NextiaJD
TLDR
NextiaJD proposes a ranking of candidate pairs according to their join quality, which is based on a novel similarity measure that considers both containment and cardinality proportions between candidate attributes. Expand
WebLens: Towards Web-scale Data Integration, Training the Models
  • R. Khan, M. Gubanov
  • Computer Science
  • 2020 IEEE International Conference on Big Data (Big Data)
  • 2020
TLDR
WebLens, a scalable data integration system, first, trains Deep Learning models to find and match semantically similar tables, then derives mediated schemas for these subsets to enable uniform access to all relevant data. Expand
...
1
2
...

References

SHOWING 1-10 OF 78 REFERENCES
Data Lake Organization
TLDR
Through a formal user study, it is shown that navigation can help users discover relevant tables that cannot be found by keyword search and that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization. Expand
Interactive Navigation of Open Data Linkages
TLDR
The Toronto Open Data Search system offers users a highly interactive experience making unrelated (and unlinked) dynamic collections of datasets appear as a richly connected cloud of data that can be navigated and combined easily in real time. Expand
Facet discovery for structured web search: a query-log mining approach
TLDR
This paper model the user faceted-search behavior using the intersection of web query-logs with existing structured data and presents an automated solution that elicits user preferences on attributes and values, employing different disambiguation techniques ranging from simple keyword matching, to more sophisticated probabilistic models. Expand
Table Union Search on Open Data
TLDR
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes. Expand
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. Expand
The Data Civilizer System
TLDR
Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data. Expand
Finding related tables
TLDR
This work considers the problem of finding related tables in a large corpus of heterogenous tables and proposes a framework that captures several types of relatedness, including tables that are candidates for joins and tables that is candidates for union. Expand
Aurum: A Data Discovery System
TLDR
This paper introduces a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems, and introduces a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data. Expand
Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks
TLDR
This paper shows how to efficiently compute a sum over the exponential number of networks that are consistent with a fixed order over network variables, and uses this result as the basis for an algorithm that approximates the Bayesian posterior of a feature. Expand
Recovering Semantics of Tables on the Web
TLDR
A system that attempts to recover the semantics of tables by enriching the table with additional annotations, which leverages a database of class labels and relationships automatically extracted from the Web. Expand
...
1
2
3
4
5
...