Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

  title={Putting Lipstick on Pig: Enabling Database-style Workflow Provenance},
  author={Yael Amsterdamer and Susan B. Davidson and Daniel Deutch and Tova Milo and Julia Stoyanovich and Val Tannen},
  journal={Proc. VLDB Endow.},
Workflow provenance typically assumes that each module is a "black-box", so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (fine-grained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style… 

Figures from this paper

OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs

OneProvenance addresses the unique challenges of log-based extraction by identifying query execution dependencies through efficient log analysis, extracting provenance through novel event transformations that account for query dependencies, and introducing effective filtering optimizations.

Capturing and Querying Structural Provenance in Spark with Pebble

P Pebble is demonstrated, a system for capturing and querying a new type of provenance on nested data in Spark called structural provenance, which captures access and modification of top-level as well as nested data items, and allows querying the provenance of nested items based on tree-pattern-matching.

Fine-Grained Provenance for Matching & ETL

PROVision is proposed, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects and extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation.

Ariadne: managing fine-grained provenance on data streams

A novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network to reduce the computational and storage overhead of provenance generation and retrieval is introduced.

Towards Integrating Workflow and Database Provenance

This paper addresses the mismatch between the different kinds of provenance by using a temporal model which explicitly represents the database states as updates are applied, and discusses how reproducibility is obtained for workflows that manipulate databases, and how different queries that straddle the two provenance traces can be evaluated.

CF-PROV: A Content-Rich and Fine-Grained Scientific Workflow Provenance Model

This model provides normative transformations and documentation declarations for the multi-field SWFs, reducing the programming overhead and increasing the versatility, and the experiments on the model compression ratio and model generation time in multiple scientific fields demonstrate the versatility and rationality of the CF-PROV.

Composition and Substitution in Provenance and Workflows

A model and operations are proposed that indicate that a basic adjustment to provenance models is needed if they are properly to accommodate such an operational approach to composition and substitution.

Logical Provenance in Data-Oriented Workflows ∗ ( Long Version )

A new general definition of provenance for general transformations is given, introducing the notions of correctness, precision, and minimality, and a simple logical-provenance specification language consisting of attribute mappings and filters is described.

Workflow Provenance for Big Data: From Modelling to Reporting

This work proposes a programming model for automated workflow logging and implements it on Bioinformatics research—for evaluation and collect workflow logs from various scientific pipelines’ executions, and focuses on some fundamental provenance questions inspired by recent literature.

Scalable Provenance Storage and Querying Using Pig Latin for Big Data Workflows

This paper leverages Pig Latin, a high-level platform for creating programs that run on Apache Hadoop, and OPQL, a graph-level provenance query language, to build a scalable provenance storage and querying system for big data workflows.

A Graph Model of Data and Workflow Provenance

A previously-developed dataflow language is extended which supports both database-style querying and workflow-style batch processing steps to produce a workflow- style provenance graph that can be explicitly queried and gives an executable definition of the graph semantics of dataflow expressions.

Layering in Provenance Systems

A provenance collection structure facilitating the integration of provenance across multiple levels of abstraction is designed, including a workflow engine, a web browser, and an initial runtime Python provenance tracking wrapper that sits atop provenance-aware network storage that builds upon a Provenance-Aware Storage System (PASS).

Provenance for Generalized Map and Reduce Workflows

It is shown how data provenance can be captured for map and reduce functions transparently and used to support backward tracing and forward tracing, and properties that are guaranteed to hold when provenance is applied recursively are identified.

Karma2: Provenance Management for Data-Driven Workflows

This work addresses the challenge to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services.

Fine-grained and efficient lineage querying of collection-based workflow provenance

This paper provides an approach to provenance querying that avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results, and provides fine grained provenances querying, even for workflows that create and consume collections.

On the expressiveness of implicit provenance in query and update languages

The model is also relevant to annotation propagation schemes in which annotations on the input to a query or update have to be transferred to the output or vice versa, and it is shown that provenance separates the expressive power of query and update languages.

Ibis: A Provenance Manager for Multi-Layer Systems

The central contribution of the work is a formal model of multi-granularity data provenance relationships, and a corresponding query language, and the simplicity and power of the query language are illustrated via several real-world-inspired examples.

Annotated XML: queries and provenance

A formal framework for capturing the provenance of data appearing in XQuery views of XML is presented and decorate unordered XML with annotations from commutative semirings and shows that these annotations suffice for a large positive fragment of XQuery applied to this data.

Provenance and scientific workflows: challenges and opportunities

This tutorial provides an overview of research issues in provenance for scientific workflows, with a focus on recent literature and technology in this area, aimed at a general database research audience and at people who work with scientific data and workflows.

Provenance for aggregate queries

This work proposes a new approach to capture provenance by annotating with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation.