Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

Abstract

Short listings such as classified ads or product listings abound on the web. If a computer can reliably extract information from them, it will greatly benefit a variety of applications. Short listings are, however, challenging to process due to their informal styles. In this paper, we present an unsupervised information extraction system for short listings. Given a corpus of listings, the system builds a semantic model that represents typical objects and their attributes in the domain of the corpus, and then uses the model to extract information. Two key features in the system are a semantic parser that extracts objects and their attributes and a listing-focused clustering module that helps group together extracted tokens of same type. Our evaluation shows that the semantic model learned by these two modules is effective across multiple domains.

Extracted Key Phrases

12 Figures and Tables

Cite this paper

@inproceedings{Kim2012BuildingAL, title={Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings}, author={Doo Soon Kim and Kunal Verma and Peter Z. Yeh}, booktitle={EMNLP-CoNLL}, year={2012} }