• Publications
  • Influence
Open Information Extraction from the Web
TLDR
Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input, is introduced.
Web-scale information extraction in knowitall: (preliminary results)
TLDR
KnowItAll, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner, is introduced.
WebTables: exploring the power of tables on the web
TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.
Data Integration for the Relational Web
TLDR
Octopus is a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web, to offer the user a set of best-effort operators that automate the most labor-intensive tasks.
TextRunner: Open Information Extraction on the Web
TLDR
The TextRunner system demonstrates a new kind of information extraction, called Open Information Extraction (OIE), in which the system makes a single, data-driven pass over the entire corpus and extracts a large set of relational tuples, without requiring any human input.
Uncovering the Relational Web
TLDR
This paper gives an in-depth study of the Web's HTML table corpus, and describes a system for performing relation recovery that achieves precision and recall that are comparable to other domain-independent information extraction systems.
Automatic web spreadsheet data extraction
TLDR
A system that automatically extracts relational data from spreadsheets, thereby enabling relational spreadsheet integration and a novel view of how users organize their data in spreadsheets is presented.
KnowItNow: Fast, Scalable Information Extraction from the Web
TLDR
A novel architecture for IE that obviates queries to commercial search engines is introduced, embodied in a system called KnowItNow that performs high-precision IE in minutes instead of days, and the tradeoff between recall and speed is quantified.
Automatic Optimization for MapReduce Programs
TLDR
Manimal is shown, which automatically analyzes MapReduce programs and applies appropriate data-aware optimizations, thereby requiring no additional help at all from the programmer, and that it yields speedups of up to 1,121% on previously-written Map Reduce programs.
...
...