Data Lake Management: Challenges and Opportunities

@article{Nargesian2019DataLM,
  title={Data Lake Management: Challenges and Opportunities},
  author={Fatemeh Nargesian and Erkang Zhu and Ren{\'e}e J. Miller and Ken Q. Pu and Patricia C. Arocena},
  journal={Proc. VLDB Endow.},
  year={2019},
  volume={12},
  pages={1986-1989}
}
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management. PVLDB Reference Format: Fatemeh Naregsian, Erkang Zhu, Renée J… Expand
Loch Prospector: Metadata Visualization for Lakes of Open Data
TLDR
Loch Prospector is proposed, a visualization to assist data management researchers in exploring and understanding the most crucial structural aspects of Open Data — in particular, metadata attributes — and the associated task abstraction for their work. Expand
Data lake concept and systems: a survey
TLDR
This survey reviews the development, definition, and architectures of data lakes and classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes. Expand
Data Management in the Data Lake: A Systematic Mapping
TLDR
The study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction. Expand
Enterprise Data Lake Management in Business Intelligence and Analytics
The data lake has recently emerged as a scalable architecture for storing, integrating, and analyzing massive data volumes characterized by diverse data types, structures, and sources. While the dataExpand
A Zone Reference Model for Enterprise-Grade Data Lake Management
TLDR
This work assesses existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case and develops a zone reference model for enterprise-grade data lake management in a detailed manner. Expand
Finding Related Tables in Data Lakes for Interactive Data Science
TLDR
This work develops search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Expand
Architecting an Enterprise Data Lake, A Covid19 Case Study
TLDR
This paper intends to provide a comprehensive architecture of data lake to address the challenges of rate of increase of data and conducts experiments with publicly available datasets about COVID19 to validate the design and applicability of the proposed architecture for business analytics purposes. Expand
Towards Learned Metadata Extraction for Data Lakes
TLDR
This paper shows the result of a study when applying Sato — a recent approach based on deep learning — to a real-world data set and proposes a new direction of using weak supervision and presents results of an initial prototype built to generate labeled training data with low manual efforts to improve the performance of learned semantic type extraction approaches on new unseen data sets. Expand
Spatial Data Lake for Smart Cities: From Design to Implementation
In this paper, we propose a methodology for designing data lake dedicated to Spatial Data and an implementation of this specific framework. Inspired from previous proposals on general data lakeExpand
A critical review of the data pipeline: how wastewater system operation flows from data to intelligence.
TLDR
This review paper examines the state-of-the-art in the transformation of raw data into actionable insight, specifically for water resource recovery facility (WRRF) operation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 53 REFERENCES
Constance: An Intelligent Data Lake System
TLDR
Constance is a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources that discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. Expand
CLAMS: Bringing Quality to Data Lakes
TLDR
CLAMS is presented, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples), and introduces a scale-out solution to efficiently detect errors in the raw data. Expand
Optimizing Organizations for Navigating Data Lakes
TLDR
This work presents a new probabilistic model of how users interact with an organization and defines the likelihood of a user finding an attribute using the organization, using attribute values and metadata when available. Expand
Draining the Data Swamp: A Similarity-based Approach
TLDR
It is argued that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization, and an initial step is proposed, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. Expand
Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets
TLDR
DATAMARAN is an tool that extracts structure from semi-structured log datasets with no human supervision, and can achieve 95% extraction accuracy on automatically collected log datasets from GitHub---a substantial 66% increase of accuracy compared to unsupervised schemes from prior work. Expand
Aurum: A Data Discovery System
TLDR
This paper introduces a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems, and introduces a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data. Expand
Big data integration
  • D. Srivastava
  • Computer Science
  • 2013 IEEE 29th International Conference on Data Engineering (ICDE)
  • 2013
TLDR
This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community. Expand
Making Open Data Transparent: Data Discovery on Open Data
TLDR
Open Data poses interesting new challenges for data integration research and one of those challenges is data discovery, how can the authors find new data sets within this ever expanding sea of Open Data. Expand
DataHub: Collaborative Data Science & Dataset Version Management at Scale
TLDR
This work proposes a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and a platform, DATAHUB, that gives users the able to perform collaborative data analysis building on this versions control system. Expand
The Data Civilizer System
TLDR
Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data. Expand
...
1
2
3
4
5
...