Gleaning information from the Web: Using Syntax to Filter out Irrelevant Information


In this paper we describe a system called Glean which is predicated on the idea that any coher ent text contains signi cant latent information such as syntactic structure and patterns of lan guage use which can be used to enhance the per formance of Information Retrieval systems We propose an approach to information retrieval that makes use of syntactic information obtained us ing a tool called a supertagger A supertag ger is used on a corpus of training material to semi automatically induce patterns that we call augmented patterns We show how these aug mented patterns may be used along with a stan dard Web search engine or an IR system to re trieve information and to identify relevant infor mation and lter out irrelevant items We de scribe an experiment in the domain of o cial ap pointments where such patterns are shown to reduce the number of potentially irrelevant doc uments by upwards of Introduction IR and WWW Vast amounts of textual information are now available in machine readable form and a signi cant proportion of this is available over the World Wide Web WWW However any particular user would typically be inter ested only in a fraction of the information available The goal addressed by Information Retrieval IR sys tems and services in general and by search engines on the Web in particular is to retrieve all and only the information that is relevant to the query posed by a user Early information retrieval systems treated stored text as arbitrary streams of characters Retrieval was usually based on exact word matching and it did not matter if the stored text was in English Hindi Span ish etc Later IR systems treated text as a collection On leave from the National Centre for Software Technology Gulmohar Cross Road No Juhu Bom bay India of words and hence several new features were made possible including the use of term expansion mor phological analysis and phrase indexing However all these methods have their limitations and there have been several attempts to go beyond these methods See Salton McGill Frakes Baeza Yates for further details on work in information retrieval With the recent growth in activity on the Web much more information has become accessible online Sev eral search engines have been developed to handle this explosion of information These search engines typi cally explore hyperlinks on the Web and index infor mation that they encounter All the information that they index thus becomes available to users searches As with most IR systems these search engines use in verted indexes to ensure speed of retrieval and the user is thus able to get pointers to potentially rele vant information very fast However these systems usually o er only keyword based searches Some of fer boolean searches and features such as proximity and adjacency operators Since the retrieval engines are geared to maximizing recall there is little or no attempt to intelligently lter the information spewed out at the user The user has to scan a large number of potentially relevant items to get to the information that she is actually looking for Thus even among ex perienced users of IR systems there is a high degree of frustration experienced in searching for information on the Web Many of the non image documents available on the Web are natural language NL texts Since they are available in machine readable form there is a lot of scope for trying out di erent NL techniques on these texts However there has not been much work in ap plying these techniques to tasks such as information retrieval In this paper we describe an application which uses NL techniques to enhance retrieval The system we describe is predicated on the fact that any coherent text contains signi cant latent information such as syntactic structure and patterns of language use which can be used to reduce an IR or Web users information load

Extracted Key Phrases

Cite this paper

@inproceedings{Chandrasekar1996GleaningIF, title={Gleaning information from the Web: Using Syntax to Filter out Irrelevant Information}, author={R. Chandrasekar and B. Srinivas}, year={1996} }