Learn More
Web search engines discover indexable documents by recursively 'crawling' from a seed URL. Their rankings take into account link popularity. While this works well, it introduces biases towards older documents. Older documents are more likely to be the target of links, while new documents with few, or no, incoming links are unlikely to rank highly in search(More)
When a searcher submits a query Q and clicks on document R in the corresponding result set, we may plausibly interpret the click as a vote that Q is a description of R. We call the Q and R pairing a 'click description'. Click descriptions thus derived from search engine logs can be accumulated into surrogate documents and used to boost retrieval(More)
In real world use of test collection methods, it is essential that the query test set be representative of the work load expected in the actual application. Using a random sample of queries from a media company's query log as a 'gold standard' test set we demonstrate that biases in sitemap-derived and top <i>n</i> query sets can lead to significant(More)
Web pages contain both unique text, which we should include in indexes, and template text such as navigation strips and copyright notices which we may want to discard. While algorithms exist for removing template text, most rely on first completing a crawl and then parsing each page. We present a cheap and efficient algorithm which does not parse HTML and(More)
Tuning a search facility such as a Web search engine, or an enterprise search tool deployed in a particular organisation, is an economically important activity. Intuitively, an important end goal of tuning should be to maximise satisfaction across the searchers who will use the facility. Tuning should therefore use an unbiased sample of actual search(More)
Tags and emergent folksonomies are a potentially rich new source of document annotations, offering query independent and dependent evidence for exploitation by information retrieval systems. Previous research has shown that tags may facilitate improved web search in an environment where each tagging action generates a (user, tag, resource) triple. For(More)