Why and Where: A Characterization of Data Provenance

  title={Why and Where: A Characterization of Data Provenance},
  author={Peter Buneman and Sanjeev Khanna and Wang Chiew Tan},
  booktitle={International Conference on Database Theory},
With the proliferation of database views and curated databases, the issue of data provenance - where a piece of data came from and the process by which it arrived in the database - is becoming increasingly important, especially in scientific databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query. We adopt a syntactic approach and present… 

A Copy-and-Paste Model for Provenance in Curated Databases

This paper introduces a query language that can express relational queries and an update language that extends the copy-paste language with bulk operations, and shows that the languages are equivalent in expressive power but not equivalent in provenance behavior.

Research Problems in Data Provenance

  • W. Tan
  • Computer Science
    IEEE Data Eng. Bull.
  • 2004
The problem of supporting data provenance in scientific database applications is motivated and the DBNotes prototype developed at UC Santa Cruz is described that can be used to “eagerly” trace the provenance and flow of relational data.

Provenance in Spatial Queries

This paper deals with the computation of How–, Why– and Where– provenance in spatial database queries and presents an evaluation of how the formalism and methods proposed to deal with general-purpose database queries behave when dealing with spatial data.

A Declarative Query Language for Data Provenance (Research Track)

This paper introduces a novel high-level structured query language, named ProvQL, which is suitable for seeking information related to data provenance, and treats provenance information as a first class citizen and allows formulating queries about the sources that contributed to data generation and the operations involved.

Conceptual modeling of data with provenance

This work contributes a conceptual model for data and provenance, and evaluates how well it addresses opportunities to make provenance easy to manage and query, and defines a benchmark suite with which to study performance of this model.

Language-integrated provenance

Provenance, or information about the origin or derivation of data, is important for assessing the trustworthiness of data and identifying and correcting mistakes. Most prior implementations of data

A survey of data provenance techniques

The main aspect of the taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and storeprovenance, and ways to disseminate it can help those building scientific and business metadata-management systems to understand existing provenance system designs.

Tracing where and who provenance in Linked Data: A calculus

Provenance in a Modifiable Data Set

  • Jing ZhangH. Jagadish
  • Computer Science
    In Search of Elegance in the Theory and Practice of Computation
  • 2013
This chapter addresses two key problems in this context: provenance in the context of data modifications, and how data sources used for deriving results of interest are modified over time.

Provenance in Databases: Past, Current, and Future

  • W. Tan
  • Computer Science
    IEEE Data Eng. Bull.
  • 2007
An overview of research in provenance in databases is provided and some future research directions are discussed, based on the tutorial presented at SIGMOD 2007.



Supporting fine-grained data lineage in a database visualization environment

This paper proposes a novel method to support fine-grained data lineage that lazily computes the lineage using a limited amount of information about the processing operators and the base data, and introduces the notions of weak inversion and verification.

A query language and optimization techniques for unstructured data

Here a simple language UnQL is proposed for querying data organized as a rooted, edge-labeled graph and it is shown that known optimization techniques for operators on flat relations apply to the "horizontal" dimension of UnQL.

The Lorel query language for semistructured data

The main novelties of the Lorel language are the extensive use of coercion to relieve the user from the strict typing of OQL, which is inappropriate for semistructured data; and powerful path expressions, which permit a flexible form of declarative navigational access and are particularly suitable when the details of the structure are not known to the user.

Practical lineage tracing in data warehouses

  • Yingwei CuiJ. Widom
  • Computer Science
    Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)
  • 2000
A lineage tracing package for relational views with aggregation is implemented in the WHIPS data warehousing system prototype at Stanford, and a number of schemes for storing auxiliary views that enable consistent and efficient lineage tracing in a multi-source data warehouse are proposed.

View maintenance in a warehousing environment

This work introduces a new algorithm, ECA (for "Eager Compensating Algorithm"), that eliminates the anomalies of previous incremental view maintenance algorithms, but extra "compensating" queries are used to eliminate anomalies.

Object exchange across heterogeneous information sources

An object-based information exchange model and a corresponding query language are defined that are well suited for integration of diverse information sources and used to integrate heterogeneous bibliographic information sources.

Data on the Web: From Relations to Semistructured Data and XML

A Syntax for Data: Typing semistructured data and the Lore system and database products supporting XML are explained.

Foundations of Databases

This book discusses Languages, Computability, and Complexity, and the Relational Model, which aims to clarify the role of Semantic Data Models in the development of Query Language Design.

On conjunctive queries containing inequalities

Algorithms for containment and equivalence of such “inequality queries” are given, under the assumption that the data domains are dense and totally ordered.