Putting Lipstick on Pig: Enabling Database-style Workflow Provenance
@article{Amsterdamer2011PuttingLO, title={Putting Lipstick on Pig: Enabling Database-style Workflow Provenance}, author={Yael Amsterdamer and Susan B. Davidson and Daniel Deutch and Tova Milo and Julia Stoyanovich and Val Tannen}, journal={Proc. VLDB Endow.}, year={2011}, volume={5}, pages={346-357} }
Workflow provenance typically assumes that each module is a "black-box", so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (fine-grained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style…
158 Citations
OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs
- 2022
Computer Science
ArXiv
OneProvenance addresses the unique challenges of log-based extraction by identifying query execution dependencies through efficient log analysis, extracting provenance through novel event transformations that account for query dependencies, and introducing effective filtering optimizations.
Capturing and Querying Structural Provenance in Spark with Pebble
- 2019
Computer Science
SIGMOD Conference
P Pebble is demonstrated, a system for capturing and querying a new type of provenance on nested data in Spark called structural provenance, which captures access and modification of top-level as well as nested data items, and allows querying the provenance of nested items based on tree-pattern-matching.
Fine-Grained Provenance for Matching & ETL
- 2019
Computer Science
2019 IEEE 35th International Conference on Data Engineering (ICDE)
PROVision is proposed, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects and extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation.
Ariadne: managing fine-grained provenance on data streams
- 2013
Computer Science
DEBS '13
A novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network to reduce the computational and storage overhead of provenance generation and retrieval is introduced.
Towards Integrating Workflow and Database Provenance
- 2012
Computer Science
IPAW
This paper addresses the mismatch between the different kinds of provenance by using a temporal model which explicitly represents the database states as updates are applied, and discusses how reproducibility is obtained for workflows that manipulate databases, and how different queries that straddle the two provenance traces can be evaluated.
CF-PROV: A Content-Rich and Fine-Grained Scientific Workflow Provenance Model
- 2019
Computer Science
IEEE Access
This model provides normative transformations and documentation declarations for the multi-field SWFs, reducing the programming overhead and increasing the versatility, and the experiments on the model compression ratio and model generation time in multiple scientific fields demonstrate the versatility and rationality of the CF-PROV.
Composition and Substitution in Provenance and Workflows
- 2016
Computer Science
TaPP
A model and operations are proposed that indicate that a basic adjustment to provenance models is needed if they are properly to accommodate such an operational approach to composition and substitution.
Logical Provenance in Data-Oriented Workflows ∗ ( Long Version )
- 2012
Computer Science
A new general definition of provenance for general transformations is given, introducing the notions of correctness, precision, and minimality, and a simple logical-provenance specification language consisting of attribute mappings and filters is described.
Workflow Provenance for Big Data: From Modelling to Reporting
- 2019
Computer Science
This work proposes a programming model for automated workflow logging and implements it on Bioinformatics research—for evaluation and collect workflow logs from various scientific pipelines’ executions, and focuses on some fundamental provenance questions inspired by recent literature.
Scalable Provenance Storage and Querying Using Pig Latin for Big Data Workflows
- 2017
Computer Science
2017 IEEE International Conference on Services Computing (SCC)
This paper leverages Pig Latin, a high-level platform for creating programs that run on Apache Hadoop, and OPQL, a graph-level provenance query language, to build a scalable provenance storage and querying system for big data workflows.
30 References
A Graph Model of Data and Workflow Provenance
- 2010
Computer Science
TaPP
A previously-developed dataflow language is extended which supports both database-style querying and workflow-style batch processing steps to produce a workflow- style provenance graph that can be explicitly queried and gives an executable definition of the graph semantics of dataflow expressions.
Layering in Provenance Systems
- 2009
Computer Science
USENIX Annual Technical Conference
A provenance collection structure facilitating the integration of provenance across multiple levels of abstraction is designed, including a workflow engine, a web browser, and an initial runtime Python provenance tracking wrapper that sits atop provenance-aware network storage that builds upon a Provenance-Aware Storage System (PASS).
Provenance for Generalized Map and Reduce Workflows
- 2011
Computer Science
CIDR
It is shown how data provenance can be captured for map and reduce functions transparently and used to support backward tracing and forward tracing, and properties that are guaranteed to hold when provenance is applied recursively are identified.
Karma2: Provenance Management for Data-Driven Workflows
- 2008
Computer Science
Int. J. Web Serv. Res.
This work addresses the challenge to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services.
Fine-grained and efficient lineage querying of collection-based workflow provenance
- 2010
Computer Science
EDBT '10
This paper provides an approach to provenance querying that avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results, and provides fine grained provenances querying, even for workflows that create and consume collections.
On the expressiveness of implicit provenance in query and update languages
- 2008
Computer Science, Linguistics
TODS
The model is also relevant to annotation propagation schemes in which annotations on the input to a query or update have to be transferred to the output or vice versa, and it is shown that provenance separates the expressive power of query and update languages.
Ibis: A Provenance Manager for Multi-Layer Systems
- 2011
Computer Science
CIDR
The central contribution of the work is a formal model of multi-granularity data provenance relationships, and a corresponding query language, and the simplicity and power of the query language are illustrated via several real-world-inspired examples.
Annotated XML: queries and provenance
- 2008
Computer Science
PODS
A formal framework for capturing the provenance of data appearing in XQuery views of XML is presented and decorate unordered XML with annotations from commutative semirings and shows that these annotations suffice for a large positive fragment of XQuery applied to this data.
Provenance and scientific workflows: challenges and opportunities
- 2008
Computer Science
SIGMOD Conference
This tutorial provides an overview of research issues in provenance for scientific workflows, with a focus on recent literature and technology in this area, aimed at a general database research audience and at people who work with scientific data and workflows.
Provenance for aggregate queries
- 2011
Computer Science
PODS
This work proposes a new approach to capture provenance by annotating with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation.