Unsupervised named-entity extraction from the Web: An experimental study

Abstract

The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL’s novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 facts, but suggested a challenge: How can we improve KNOWITALL’s recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL’s domainindependent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on named-entity extraction, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall, while maintaining high precision, and discovered over 10,000 cities missing from the Tipster Gazetteer.

DOI: 10.1016/j.artint.2005.03.001

Extracted Key Phrases

14 Figures and Tables

050100'05'06'07'08'09'10'11'12'13'14'15'16'17
Citations per Year

1,085 Citations

Semantic Scholar estimates that this publication has 1,085 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Etzioni2005UnsupervisedNE, title={Unsupervised named-entity extraction from the Web: An experimental study}, author={Oren Etzioni and Michael J. Cafarella and Doug Downey and Ana-Maria Popescu and Tal Shaked and Stephen Soderland and Daniel S. Weld and Alexander Yates}, journal={Artif. Intell.}, year={2005}, volume={165}, pages={91-134} }