Learn More
We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours(More)
We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performance of a parallel relational database system for this(More)
<i>We present a method of searching text collections that takes advantage of hierarchrical information within documents and integrates searches of structured and unstructured data. We show that Multidimensional databases (MDB), designed for accessing data along hierarchical dimensions, are effective for information retrieval. We demonstrate a method of(More)
Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both(More)
The need for effective identity matching systems has led to extensive research in the area of name search. For the most part, such work has been limited to English and other Latin-based languages. Consequently, algorithms such as Soundex and n-gram matching are of limited utility for languages such as Arabic, which has a vastly different morphology that(More)
We present a new algorithm for duplicate document detection thatuses collection statistics. We compare our approach with thestate-of-the-art approach using multiple collections. Thesecollections include a 30 MB 18,577 web document collectiondeveloped by Excite@Home and three NIST collections. The first NISTcollection consists of 100 MB 18,232 LA-Times(More)
Prior efforts have shown that under certain situations, retrieval effectiveness may be improved via the use of data fusion techniques. Although these improvements have been observed from the fusion of result sets from several distinct information retrieval systems, it has often been thought that fusing different document retrieval strategies in a single(More)