Lineage tracing for general data warehouse transformations

  title={Lineage tracing for general data warehouse transformations},
  author={Yingwei Cui and Jennifer Widom},
  journal={The VLDB Journal},
Abstract. Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items… 

Data Lineage and Meta Data Analysis in Data Warehouse Environments

This thesis proposes a system to compute the lineage, targeted at business users without technical knowledge about IT systems, and develops an algorithm that is capable of extracting components generically, based on their semantic meaning and relation to each other.

Metadata to Support Transformations and Data & Metadata Lineage in a Warehousing Environment

This work proposes integrating metadata captured during transformation processes using the CWM metadata standard in order to enable data and metadata lineage and presents a tool specially developed for performing this task.

Tracing Data Lineage Using Schema Transformation Pathways

This chapter proposes a new approach for tracing data lineage which is based on schema transformation pathways and shows how the individual transformation steps in a transformation pathway can be used to trace the derivation of the integrated data in a step-wise fashion.

Investigating a heterogeneous data integration approach for data warehousing

This thesis investigates how AutoMed metadata can be used to express the schemas present in a data warehouse environment and to represent data warehouse processes such as data transformation, data cleansing, data integration, and data summarization and discusses how the approach can be use for handling schema evolution in such a materialised data integration scenario.

Data Provenance in ETL Scenarios Extract – Transform – Load Processes

The term ETL is used in a wider sense to refer to any process of exchanging and transforming data between data stores, and is the core functionality of a recently emerging type of web applications, where information is extracted from various web sites, and it is appropriately transformed and integrated before presented to the final user.

Data Lineage Tracing in Data Warehousing Environments

This paper extends the DLT approach to using full schema transformation pathways and discusses the problem of lineage data ambiguities, finding derivations of integrated data in integrated database systems.

Tracing Data Lineage Using Automed Schema Transformation Pathways

Automed is a database transformation and integration system supporting both virtual and materialized integration of schemas expressed in a variety of modelling languages, and a set of primitive schema transformations operate on HDM schemas.

Using AutoMed for Data Warehousing

How the AutoMed approach can be used for data warehousing processes, especially for Data Lineage Tracing (DLT) in a heterogeneous data ware housing environment is discussed.

Lineage Tracing in Mediator-Based Information Integration Systems

This paper studies the lineage tracing problem in mediator-based systems and proposes a solution by collecting “enough” data and metadata during query processing so that tracing is possible in some situations, e.g., when a source becomes unavailable.

Tracing Lineage Beyond Relational Operators

A novel technique is proposed that enables the tracing of lineage of data generated by an arbitrary function and does not require any high-level description of the function or even the source code and can help identify limitations in the function itself.



Lineage tracing in data warehouses

This thesis presents formal definitions of data lineage for data warehouses defined as relational materialized views over relational sources, and for warehouses defined using graphs of general data transformations, along with algorithms for lineage tracing along with a suite of optimization techniques.

Tracing the lineage of view data in a warehousing environment

The lineage problem is formally defined, lineage tracing algorithms for relational views with aggregation are developed, and mechanisms for performing consistent lineage tracing in a multisource data warehousing environment are proposed.

Efficient resumption of interrupted warehouse loads

This work develops a resumption algorithm called DR that imposes no overhead and relies only on the high-level properties of the transformations of the data and shows that DR can lead to a ten-fold reduction in resumption time by performing experiments using commercial software.

Transforming Heterogeneous Data with Database Middleware: Beyond Integration

This paper looks at database middleware systems as tranformation engines, and discusses when and how data is transformed to provide users with the information they need.

Data extraction and transformation for the data warehouse

The data warehouse must replace old legacy applications for effective information processing, and it is necessary to understand the root causes of the difficulty in getting information in the first place.

An Interactive Framework for Data Cleaning

An interactive framework for data cleaning that tightly integrates transformation and discrepancy detection is presented, and a set of transforms that can be used for transformations within data records as well as for higher-order transformations are chosen.

Supporting fine-grained data lineage in a database visualization environment

This paper proposes a novel method to support fine-grained data lineage that lazily computes the lineage using a limited amount of information about the processing operators and the base data, and introduces the notions of weak inversion and verification.

An overview of data warehousing and OLAP technology

An overview of data warehousing and OLAP technologies, with an emphasis on their new requirements, is provided, based on a tutorial presented at the VLDB Conference, 1996.

Managing Derived Data in the Gaea Scientific DBMS

This paper presents a framework for capturing and managing scientific data derivation histories as implemented in the Gaea scientific database management system and proposes to extend current semantic modeling and object-oriented technology with special constructs: concepts, processes, and tasks.

SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems

A principled extension of SQL, called SchemaSQL, is provided, that offers the capability of uniform manipulation of data and meta-data in relational multi-database systems and provides a great facility for interoperability and data/meta-data management in relationalmulti- database systems.