Open Information Extraction from the Web

@inproceedings{Banko2008OpenIE,
  title={Open Information Extraction from the Web},
  author={Michele Banko and Michael J. Cafarella and Stephen Soderland and Matthew Broadhead and Oren Etzioni},
  booktitle={CACM},
  year={2008}
}
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements. [...] Key Method The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare…Expand
TextRunner: Open Information Extraction on the Web
TLDR
The TextRunner system demonstrates a new kind of information extraction, called Open Information Extraction (OIE), in which the system makes a single, data-driven pass over the entire corpus and extracts a large set of relational tuples, without requiring any human input.
Open Information Extraction Using Wikipedia
TLDR
WOE is presented, an open IE system which improves dramatically on TextRunner's precision and recall and is a novel form of self-supervised learning for open extractors -- using heuristic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data.
Open Language Learning for Information Extraction
Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary
RDR-based open IE for the web document
TLDR
The key advantages of this approach are that it can handle the freer writing style that occurs in Web documents and can correct errors introduced by natural language pre-processing tools, whereas systems like TEXTRUNNER depend on the quality of the entity-tagging preprocessing in the training data.
Unsupervised Relation Extraction with General Domain Knowledge
TLDR
Evaluation results on the ACE 2007 English Relation Detection and Categorization (RDC) task show that the proposed model outperforms competitive unsupervised approaches by a wide margin and is able to produce clusters shaped by both the data and the rules.
Semi-supervised Bootstrapping of Relation Triples from the Web , Query Languages over these Noisy Triples , their Semantics , and Query Execution Systems
Information Extraction (IE) is the process of retrieving structured information from unstructured text. IE has traditionally relied on extended human interposition to extract small set of predefined
The Tradeoffs Between Open and Traditional Relation Extraction
TLDR
A new model for Open IE called O-CRF is presented and it is shown that it achieves increased precision and nearly double the recall than the model employed by TEXTRUNNER, the previous stateof-the-art Open IE system.
Prioritization of Domain-Specific Web Information Extraction
TLDR
A novel prioritization approach where candidate pages from the corpus are ordered according to their expected contribution to the extraction results and those with higher estimated potential are extracted earlier, and it is demonstrated that EP significantly outperforms the naive approach and is more flexible than the classifier approach.
Redundancy in web-scaled information extraction: probabilistic model and experimental results
TLDR
A probabilistic model of the KnowItAll hypothesis, coupled with the redundancy of the Web, can power effective IE for arbitrary target concepts without hand-labeled data, and it is proved formally that under the assumptions of the model, "Probably Approximately Correct" IE can be attained from only unlabeled data.
Navigating Extracted Data with Schema Discovery
TLDR
TGen, an algorithm for schema discovery, is proposed, which automatically derives a high-quality relational schema for the extracted data, which is useful for exploring unfamiliar data or for composing queries over extracted data.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 149 REFERENCES
The Tradeoffs Between Open and Traditional Relation Extraction
TLDR
A new model for Open IE called O-CRF is presented and it is shown that it achieves increased precision and nearly double the recall than the model employed by TEXTRUNNER, the previous stateof-the-art Open IE system.
URES : an Unsupervised Web Relation Extraction System
TLDR
URES (Unsupervised Relation Extraction System), which extracts relations from the Web in a totally unsupervised way and demonstrates that using simple noun phrase tagger is sufficient as a base for accurate patterns and compares the approach with KnowItAll's fixed generic patterns.
On-Demand Information Extraction
TLDR
On-demand Information Extraction (ODIE) aims to completely eliminate the customization effort, and is reported on on experimental results in which the system created useful tables for many topics, demonstrating the feasibility of this approach.
Relational Web Search
TLDR
The extraction graph is a textual approximation to an entity-relationship graph, which is automatically extracted from Web pages, and TextRunner, a search engine that utilizes this representation to answer complex relational queries that are difficult to answer using today’s search engines or Web Information Extraction systems.
KnowItNow: Fast, Scalable Information Extraction from the Web
TLDR
A novel architecture for IE that obviates queries to commercial search engines is introduced, embodied in a system called KnowItNow that performs high-precision IE in minutes instead of days, and the tradeoff between recall and speed is quantified.
Scaling Textual Inference to the Web
TLDR
The Holmes system, which utilizes textual inference over tuples extracted from text to scale TI to a corpus of 117 million Web pages, and its runtime is linear in the size of its input corpus.
Unsupervised named-entity extraction from the Web: An experimental study
TLDR
An overview of KnowItAll's novel architecture and design principles is presented, emphasizing its distinctive ability to extract information without any hand-labeled training examples, and three distinct ways to address this challenge are presented and evaluated.
Unsupervised Resolution of Objects and Relations on the Web
TLDR
A scalable, fully-implemented system for SR that runs in O(KN log N) time in the number of extractions N and the maximum number of synonyms per word, K, and introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them.
Information extraction from unstructured web text
TLDR
This thesis describes an extensible model for information extraction that takes advantage of the unique characteristics of Web text and leverages existent search engine technology in order to ensure the quality of the extracted information.
Information extraction from the web: techniques and applications
TLDR
This thesis examines methods for using extracted information in improving a particular kind of language processing tool, a parser, and investigates a method for resolving synonyms in extracted information.
...
1
2
3
4
5
...