Learn More
Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen(More)
Aggregated search is the task of incorporating results from different specialized search services, or verticals, into Web search results. While most prior work focuses on deciding which verticals to present, the task of deciding where in the Web results to embed the vertical results has received less attention. We propose a methodology for evaluating an(More)
Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal(More)
Traditional models of information retrieval assume documents are independently relevant. But when the goal is retrieving diverse or novel information about a topic, retrieval models need to capture dependencies between documents. Such tasks require alternative evaluation and optimization methods that operate on different types of relevance judgments. We(More)
The standard system-based evaluation paradigm has focused on assessing the performance of retrieval systems in serving the best results for a single query. Real users, however, often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate before they find what they are looking for. In this work(More)
We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can(More)