Engineering a multi-purpose test collection for Web retrieval experiments


Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text. WT10g was used in TREC-9 and TREC-2000 and both topic relevance and homepage finding queries and judgments are available. 2003 Elsevier Ltd. All rights reserved.

DOI: 10.1016/S0306-4573(02)00084-5

Extracted Key Phrases

2 Figures and Tables


Citations per Year

203 Citations

Semantic Scholar estimates that this publication has 203 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Bailey2003EngineeringAM, title={Engineering a multi-purpose test collection for Web retrieval experiments}, author={Peter Bailey and Nick Craswell and David Hawking}, journal={Inf. Process. Manage.}, year={2003}, volume={39}, pages={853-871} }