Crowdsourcing for relevance evaluation

@article{Alonso2008CrowdsourcingFR,
  title={Crowdsourcing for relevance evaluation},
  author={Omar Alonso and Daniel E. Rose and Benjamin Stewart},
  journal={SIGIR Forum},
  year={2008},
  volume={42},
  pages={9-15}
}
Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, each performs a small evaluation task. 

Figures from this paper

Obtaining High-Quality Relevance Judgments Using Crowdsourcing
TLDR
The authors evaluate their approach by comparing the consistency of crowdsourced ground truth to that obtained from expert annotators and conclude that crowdsourcing can match the quality obtained from the latter.
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking
TLDR
It is found that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.
Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review
TLDR
Different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment are explored.
Managing the Quality of Large-Scale Crowdsourcing
TLDR
It is concluded that crowdsourcing can be used as a feasible alternative to expert annotations, based on the estimated proportions of correctly judged query-document pairs in the crowdsourced relevance judgments and previous TREC qrels.
On the Evaluation of the Quality of Relevance Assessments Collected through Crowdsourcing
TLDR
The components that could be used to measure the quality of the collected relevance assessments are discussed and recommendations are based on experiments with collecting relevance assessments for digitized books, conducted as part of the INEX Book Track in 2008.
Crowdsourcing Document Relevance Assessment with Mechanical Turk
TLDR
While results are largely inconclusive, they identify important obstacles encountered, lessons learned, related work, and interesting ideas for future investigation.
Real-time quality control for crowdsourcing relevance evaluation
TLDR
A realtime strategy in recruiting workers and monitoring the quality of their relevance and rank judgments is demonstrated and has been verified by empirical results.
Exploring relevance assessment using crowdsourcing for faceted and ambiguous queries
TLDR
It is shown that the type of query used does influence the agreement between assessors and the system performance measure, and the constancy in system ranking when these two sets of relevance judgments were used to score the systems.
An Analysis of Crowdsourcing Relevance Assessments in Spanish
TLDR
The results of a series of experiments using the Spanish part of CLEF are presented, demonstrating that crowdsourcing plat- forms do work for other languages than English.
Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes
TLDR
Inspired by the clustering hypothesis of information retrieval, crowd-generated relevance judgments to similar documents are propagated, effectively smoothing the distribution of relevance labels across the similarity space.
...
...

References

SHOWING 1-10 OF 44 REFERENCES
Crowdsourcing Assessments for XML Ranked Retrieval
TLDR
This paper shows through a series of experiments on INEX data that crowdsourcing can be a good alternative for relevance assessment in the context of XML retrieval.
On the Evaluation of the Quality of Relevance Assessments Collected through Crowdsourcing
TLDR
The components that could be used to measure the quality of the collected relevance assessments are discussed and recommendations are based on experiments with collecting relevance assessments for digitized books, conducted as part of the INEX Book Track in 2008.
Crowdsourcing, attention and productivity
We show through an analysis of a massive data set from YouTube that the productivity exhibited in crowdsourcing exhibits a strong positive dependence on attention, measured by the number of
Why Is Web Search So Hard... to Evaluate?
Web search has several important characteristics that distinguish it from traditional information retrieval: the often adversarial relationship between content creators and search engine designers,
Towards methods for the collective gathering and quality control of relevance assessments
TLDR
This work proposes a method for the collective gathering of relevance assessments using a social game model to instigate participants' engagement and shows that the proposed game design achieves two designated goals: the incentive structure motivates endurance in assessors and the review process encourages truthful assessment.
Crowdsourcing user studies with Mechanical Turk
TLDR
Although micro-task markets have great potential for rapidly collecting user measurements at low costs, it is found that special care is needed in formulating tasks in order to harness the capabilities of the approach.
Search Engines that Learn from Implicit Feedback
TLDR
A search engine can use training data extracted from the logs to automatically tailor ranking functions to a particular user group or collection, and machine-learning techniques can harness to improve search quality.
Improving Search Results Quality by Customizing Summary Lengths
TLDR
Empirical evidence is presented that judges can predict appropriate search result summary lengths, and that perceptions of search result quality can be affected by varying these result lengths.
Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria
TLDR
An empirical study is conducted to examine the effect of noisy annotations on the performance of sentiment classification models, and evaluate the utility of annotation selection on classification accuracy and efficiency.
A Language Modeling Approach for Temporal Information Needs
TLDR
This work addresses information needs that have a temporal dimension conveyed by a temporal expression in the user’s query by integrating temporal expressions into a language modeling approach, thus making them first-class citizens of the retrieval model and considering their inherent uncertainty.
...
...