Learn More
Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen(More)
Aggregated search is the task of incorporating results from different specialized search services, or verticals, into Web search results. While most prior work focuses on deciding which verticals to present, the task of deciding where in the Web results to embed the vertical results has received less attention. We propose a methodology for evaluating an(More)
Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal(More)
Traditional models of information retrieval assume documents are independently relevant. But when the goal is retrieving diverse or novel information about a topic, retrieval models need to capture dependencies between documents. Such tasks require alternative evaluation and optimization methods that operate on different types of relevance judgments. We(More)
Information retrieval systems have traditionally been evaluated over absolute judgments of relevance: each document is judged for relevance on its own, independent of other documents that may be on topic. We hypothesize that preference judgments of the form " document A is more relevant than document B " are easier for assessors to make than absolute(More)
We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can(More)
We describe a pilot study using Amazon's Mechanical Turk to collect preference judgments between pairs of full-page layouts including both search results and image results. Specifically , we analyze the behavior of assessors that participated in our study to identify some patterns that may be broadly indicative of unreliable assessments. We believe this(More)
The standard system-based evaluation paradigm has focused on assessing the performance of retrieval systems in serving the best results for a single query. Real users, however, often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate before they find what they are looking for. In this work(More)
There is great interest in producing effectiveness measures that model user behavior in order to better model the utility of a system to its users. These measures are often formulated as a sum over the product of a discount function of ranks and a gain function mapping relevance assessments to numeric utility values. We develop a conceptual framework for(More)