Corpus ID: 6296617

PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines

  title={PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines},
  author={A. Sarma and A. Jain and P. Bohannon},
Complex information extraction (IE) pipelines assembled by plumbing together off-the-shelf operators, specially customized operators, and operators re-used from other text processing pipelines are becoming an integral component of most text processing frameworks. A critical task faced by the IE pipeline user is to run a post-mortem analysis on the output. Due to the diverse nature of extraction operators (often implemented by independent groups), it is time consuming and error-prone to describe… Expand
Newt : an architecture for lineage -based replay and debugging in DISC systems
New Newt is presented, a scalable architecture for capturing fine-grain lineage from DISC systems and using this information to analyze and debug analytics, and can accurately replay selected outputs, which can reduce the time to recreate errors during debugging. Expand
Scalable lineage capture for debugging DISC analytics
Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics, is presented and it is found that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics and enables novel lineage-based debugging techniques. Expand


Declarative Information Extraction Using Datalog with Embedded Extraction Predicates
This paper argues that developing information extraction programs using Datalog with embedded procedural extraction predicates is a good way to proceed, and shows how optimizing such programs raises challenges specific to text data that cannot be accommodated in the current relational optimization framework. Expand
Toward best-effort information extraction
iFlex, an IE approach that relaxes the precise IE requirement to enable best-effort IE, is proposed, in which a developer uses a declarative language to quickly write an initial approximate IE program P with a possible-worlds semantics. Expand
Understanding provenance black boxes
A model of provenance answers that follow a “roll up”, “drill down” strategy is developed, and it is shown how this information can be captured by workflow management systems, and that the structures and information needed are a negligible addition to standard provenance stores. Expand
I4E: interactive investigation of iterative information extraction
This paper develops an approach for interactive post-extraction investigation for IIE systems and formalizes three important phases of this investigation, namely, explain the IIE result, diagnose the influential and problematic components, and repair the output from an information extraction system. Expand
A quality-aware optimizer for information extraction
This article shows how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and how toUse ROC analysis to select the extraction parameters in a principled manner and presents analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Expand
Join Optimization of Information Extraction Output: Quality Matters!
A principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations is developed, arguing that the output quality is affected by the configuration of the IE systems used to process documents, the document retrieval strategies used to retrieve documents, and the actual join algorithm used. Expand
Optimizing SQL Queries over Text Databases
This work studies a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and - critically - on their result quality as well, to optimize the execution of SQL queriesover text databases in a principled, cost-based manner. Expand
Provenance in Databases: Past, Current, and Future
  • W. Tan
  • Computer Science
  • IEEE Data Eng. Bull.
  • 2007
An overview of research in provenance in databases is provided and some future research directions are discussed, based on the tutorial presented at SIGMOD 2007. Expand
On the provenance of non-answers to queries over extracted data
This work focuses on providing provenance-style explanations for non-answers and develops a mechanism for providing this new type of provenance and suggests that this approach can provide effective provenance information that can help a user resolve their doubts over non-ANSwers to a query. Expand
Exploring a Few Good Tuples from Text Databases
  • A. Jain, D. Srivastava
  • Computer Science
  • 2009 IEEE 25th International Conference on Data Engineering
  • 2009
The access model for information extraction is formalized, and efficient query processing algorithms for good(k, $\ell$) queries are investigated, which do not rely on any prior knowledge about the extraction task or the database. Expand