Learn More
We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours(More)
We present a method of searching text collections that takes advantage of hierarchrical information within documents and integrates searches of structured and unstructured data. We show that Multidimensional databases (MDB), designed for accessing data along hierarchical dimensions, are effective for information retrieval. We demonstrate a method of using(More)
Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both(More)
Prior efforts have shown that under certain situations, retrieval effectiveness may be improved via the use of data fusion techniques. Although these improvements have been observed from the fusion of result sets from several distinct information retrieval systems, it has often been thought that fusing different document retrieval strategies in a single(More)
Accurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general web collection but from topic-specific databases as well. Maintaining sufficient categorization recall is(More)
Since the use of relevance f&back in information retrieval to impmve precision and recall was first proposed in the Iate-1960's, many different techniques have been used to improve the results obtained from relevance feedback. Siice most information retrieval systems perfbrming relevance feedback use combinations of several techniques, the individual(More)
We present a new algorithm for duplicate document detection thatuses collection statistics. We compare our approach with thestate-of-the-art approach using multiple collections. Thesecollections include a 30 MB 18,577 web document collectiondeveloped by Excite@Home and three NIST collections. The first NISTcollection consists of 100 MB 18,232 LA-Times(More)
We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performance of a parallel relational database system for this(More)