Data Lake Management: Challenges and Opportunities

  title={Data Lake Management: Challenges and Opportunities},
  author={Fatemeh Nargesian and Erkang Zhu and Ren{\'e}e J. Miller and Ken Q. Pu and Patricia C. Arocena},
  journal={Proc. VLDB Endow.},
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management. PVLDB Reference Format: Fatemeh Naregsian, Erkang Zhu, Renée J… 

Figures from this paper

Loch Prospector: Metadata Visualization for Lakes of Open Data

Loch Prospector is proposed, a visualization to assist data management researchers in exploring and understanding the most crucial structural aspects of Open Data — in particular, metadata attributes — and the associated task abstraction for their work.

Data lake concept and systems: a survey

This survey reviews the development, definition, and architectures of data lakes and classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes.

Data Management in the Data Lake: A Systematic Mapping

The study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction.

Enterprise Data Lake Management in Business Intelligence and Analytics

Concrete analytics projects of a globally industrial enterprise are used to identify existing practical challenges and drive requirements for enterprise data lakes to identify research gaps in analytics practice.

A Zone Reference Model for Enterprise-Grade Data Lake Management

This work assesses existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case and develops a zone reference model for enterprise-grade data lake management in a detailed manner.

Finding Related Tables in Data Lakes for Interactive Data Science

This work develops search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables.

Architecting an Enterprise Data Lake, A Covid19 Case Study

This paper intends to provide a comprehensive architecture of data lake to address the challenges of rate of increase of data and conducts experiments with publicly available datasets about COVID19 to validate the design and applicability of the proposed architecture for business analytics purposes.

Demand-Driven Data Provisioning in Data Lakes: BARENTS — A Tailorable Data Preparation Zone

A tailorable data preparation zone for Data Lakes called BARENTS is introduced that enables users to model in an ontology how to derive information from data and assign the information to use cases and the data is automatically processed and made available to the appropriate use cases.

Towards Learned Metadata Extraction for Data Lakes

This paper shows the result of a study when applying Sato — a recent approach based on deep learning — to a real-world data set and proposes a new direction of using weak supervision and presents results of an initial prototype built to generate labeled training data with low manual efforts to improve the performance of learned semantic type extraction approaches on new unseen data sets.

Spatial Data Lake for Smart Cities: From Design to Implementation

A methodology for designing data lake dedicated to Spatial Data and an implementation of this specific framework that offers a uniform management of the spatial and thematic information embedded in the elements of the data lake are proposed.



Constance: An Intelligent Data Lake System

Constance is a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources that discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities.

CLAMS: Bringing Quality to Data Lakes

CLAMS is presented, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples), and introduces a scale-out solution to efficiently detect errors in the raw data.

Optimizing Organizations for Navigating Data Lakes

It is shown that navigation can help users discover relevant tables that cannot be found by keyword search and in this study, 42% of users preferred the use of navigation and 58% preferred keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes.

Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets

DATAMARAN is an tool that extracts structure from semi-structured log datasets with no human supervision, and can achieve 95% extraction accuracy on automatically collected log datasets from GitHub---a substantial 66% increase of accuracy compared to unsupervised schemes from prior work.

Aurum: A Data Discovery System

This paper introduces a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems, and introduces a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data.

Big data integration

  • Xin DongD. Srivastava
  • Computer Science
    2013 IEEE 29th International Conference on Data Engineering (ICDE)
  • 2013
This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

DataHub: Collaborative Data Science & Dataset Version Management at Scale

This work proposes a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and a platform, DATAHUB, that gives users the able to perform collaborative data analysis building on this versions control system.

The Data Civilizer System

Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data.

Open Data Integration

A new paradigm for thinking about integration is introduced where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs.

Goods: Organizing Google's Datasets

GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them.