Provenance Summaries for Answers and Non-Answers

@article{Lee2018ProvenanceSF,
  title={Provenance Summaries for Answers and Non-Answers},
  author={Seok-Gyun Lee and Bertram Lud{\"a}scher and Boris Glavic},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={1954-1957}
}
Explaining why an answer is (not) in the result of a query has proven to be of immense importance for many applications. However, why-not provenance, and to a lesser degree also why-provenance, can be very large, even for small input datasets. The resulting scalability and usability issues have limited the applicability of provenance. We present PUG , a system for why and why-not provenance that applies a range of novel techniques to overcome these challenges. Specifically, PUG limits… 

Figures from this paper

Approximate summaries for why and why-not provenance
TLDR
This work develops techniques for efficiently computing provenance summaries that balance informativeness, conciseness, and completeness and is the first to both scale to large datasets and generate comprehensive and meaningful summaries.
Provenance-based Data Skipping
TLDR
PBDS is a novel approach that generates provenance sketches to concisely encode what data is relevant for a query and once a provenance sketch has been captured it is used to speed up subsequent queries.
Provenance-based Data Skipping (TechReport)
TLDR
Provenance-based data skipping (PBDS) is developed, a novel approach that generates provenance sketches to concisely encode what data is relevant for a query and is used to speed up subsequent queries.
P L ] 1 1 Ju l 2 01 9 Provenance for Large-scale Datalog
TLDR
A novel bottom-up Datalog evaluation strategy for debugging that relies on a new provenance lattice that includes proof annotations, and a new fixed-point semantics for semi-näıve evaluation, which has a runtime overhead of 1.27× on average while being more flexible than existing state-of-the-art techniques.
Provenance for Large-scale Datalog
TLDR
A novel bottom-up Datalog evaluation strategy for debugging that relies on a new provenance lattice that includes proof annotations, and a new fixed-point semantics for semi-naive evaluation, and is shown to be more flexible than existing state-of-the-art techniques.
Debugging Large-scale Datalog
TLDR
A novel bottom-up Datalog evaluation strategy for debugging that relies on a new provenance lattice that includes proof annotations and a new fixed-point semantics for semi-naïve evaluation, which has a runtime overhead of 1.31× on average while being more flexible than existing state-of-the-art techniques.
Provenance in Temporal Interaction Networks
TLDR
This work investigates several quantity selection policies that apply to different application scenarios and proposes space- and time-efficient meta-data propagation mechanisms for continuously tracking provenance at vertices.
Contribution Maximization in Probabilistic Datalog
TLDR
An optimized algorithm is proposed which injects a refined variant of the classic Magic Sets technique, integrated with a sampling method, into IM algorithms, achieving a significant saving of space and execution time.
ML Based Lineage in Databases
TLDR
A novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding, and designs an alternative lineage tracking mechanism, that of keeping track of and querying lineage at the column (“gene”) level to better distinguish between the provenance features and the textual characteristics of a tuple.
U4U-Taming Uncertainty with Uncertainty-Annotated Databases Division: CISE/IIS/III
TLDR
If ignored, data uncertainty can result in hard to trace errors in analytical results, which in turn can have severe real world implications like unfounded scientific discoveries, financial damages, or even effects on people’s physical well-being.

References

SHOWING 1-10 OF 10 REFERENCES
Integrating Approximate Summarization with Provenance Capture
TLDR
An (approximate) summarization technique that generates compact representations of why and why-not provenance using patterns as a summarized representation of sets of elements from the provenance, i.e., successful or failed derivations.
A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries
TLDR
This work presents the first practical approach for answering why and why-not provenance questions for queries with negation (firstorder queries) and significantly outperforms an earlier approach which instantiates the full provenance to compute explanations.
Provenance for Natural Language Queries
TLDR
This work develops a novel method for transforming provenance information to NL, by leveraging the original NL query structure, and presents two solutions for its effective presentation as NL text: one based on provenance factorization, with novel desiderata relevant to the NL case, and one that is based on summarization.
Selective Provenance for Datalog Programs Using Top-K Queries
TLDR
A novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation, and an efficient novel algorithm based on instrumenting the datalog program so that it generates only relevant provenance.
Explaining missing answers to SPJUA queries
TLDR
The algorithms used to generate a correct, finite, and, when possible, minimal set of explanations in queries that include selection, projection, join, union, aggregation and grouping (SPJUA) are described.
The Complexity of Causality and Responsibility for Query Answers and non-Answers
TLDR
This paper adapts Halpern, Pearl, and Chockler's recent definitions of causality and responsibility to define the causes of answers and non-answers to queries, and their degree of responsibility, and demonstrates a dichotomy between PTIME and NP-complete cases.
Provenance semirings
We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and why-provenance are particular cases of the same general algorithms involving
Interpretable and Informative Explanations of Outcomes
In this paper, we solve the following data summarization problem: given a multi-dimensional data set augmented with a binary attribute, how can we construct an interpretable and informative summary
Provenance for natural language
  • queries. PVLDB,
  • 2017
Explaining Missing Answers to SPJUA
  • Queries. PVLDB,
  • 2010