On the Relation Between Assessor's Agreement and Accuracy in Gamified Relevance Assessment

  title={On the Relation Between Assessor's Agreement and Accuracy in Gamified Relevance Assessment},
  author={Olga Megorskaya and Vladimir Kukushkin and Pavel Serdyukov},
  journal={Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval},
Expert judgments (labels) are widely used in Information Retrieval for the purposes of search quality evaluation and machine learning. Setting up the process of collecting such judgments is a challenge of its own, and the maintenance of judgments quality is an extremely important part of the process. One of the possible ways of controlling the quality is monitoring inter-assessor agreement level. But does the agreement level really reflect the quality of assessor's judgments? Indeed, if a group… 
The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students
A case study on the inter-assessor disagreements in the English NTCIR-13 The authors Want Web (WWW) collection suggests that a high-agreement topic set is more useful for concrete research conclusions than a low-ag agreement one.
A Gamified Approach to Relevance Judgement
Experiments on the TREC-8 ad-hoc task with the objective of reproducing the existing relevance assessments demonstrate that gamified assessments, when used to evaluate the official submissions to T REC-8, show fair correlation with the official assessments (depth-100 pooling).
Unanimity-Aware Gain for Highly Subjective Assessments
The results show that incorporating unanimity can impact statistical signi€cance test results even when its impact on the gain value is kept to a minimum, and if researchers accept that unanimous votes should be valued more highly than controversial ones, then the proposed approach may be worth incorporating.
A Short Survey on Online and Offline Methods for Search Quality Evaluation
Evaluation plays the exact key role in the field of information retrieval as researchers and practitioners develop models to explain the relation between an information need expressed by a person and information contained in available resources, and test these models by comparing their outcomes to collections of observations.
Computational Social Indicators: A Case Study of Chinese University Ranking
A novel graph-based multi-channel ranking scheme for social indicator computation by exploring the rich multi- Channel Web data and using a unified model to learn the cluster-wise common spaces, perform ranking separately upon each space, and fuse these rankings to produce the final one.
Information Retrieval
The course is focused on one of the most popular topics in the network science: detection of communities in networks, with special attention to the optimization of global quality functions, like Newmna-Girvan modularity, and to their limits.
A Systematic Literature Review: Information Accuracy Practices in Tourism
The SLR findings had revealed that the existing research on information accuracy in the tourism context was not sufficiently practiced by the tourism information providers and that there was still room for improvement.
Increasing Engagement with the Library via Gamification
A preliminary analysis of a university library system that aims to trigger users' extrinsic motivation to increase their interaction with the system suggests that different user groups react in different ways to such 'gamified' systems.


User intent and assessor disagreement in web search evaluation
The relationship between assessor disagreement and various click based measures, such as click preference strength and user intent similarity, for judgments collected from editorial judges and crowd workers using single absolute, pairwise absolute and pairwise preference based judging methods is examined.
Relevance assessment: are judges exchangeable and does it matter
It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise, and both system scores and system rankings are subject to consistent but small differences across the three assessment sets.
An analysis of systematic judging errors in information retrieval
This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine.
Towards methods for the collective gathering and quality control of relevance assessments
This work proposes a method for the collective gathering of relevance assessments using a social game model to instigate participants' engagement and shows that the proposed game design achieves two designated goals: the incentive structure motivates endurance in assessors and the review process encourages truthful assessment.
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking
It is found that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.
Design and Implementation of Relevance Assessments Using Crowdsourcing
This work explores the design and execution of relevance judgments using Amazon Mechanical Turk as crowdsourcing platform, introducing a methodology for crowdsourcing relevance assessments and the results of a series of experiments using TREC 8 with a fixed budget.
Crowd IQ: measuring the intelligence of crowdsourcing platforms
It is shown that crowds composed of workers of high reputation achieve higher performance than low reputation crowds, and the effect of the amount of payment is non-monotone---both paying too much and too little affects performance.
Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity
Researchershavecriticizedchance-correctedagreementstatistics,particularly theKappastatistic,asbeingverysensitivetoraters'classiflcationprobabilities(marginal probabilities) and to trait prevalence in
Sample Size Calculations: Practical Methods for Engineers and Scientists
Sample Size Calculations: Practical Methods for Engineers and Scientists by Paul Mathews and Christina M. Mastrangelo.
Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter-Rater Reliability
  • Assessment Series,
  • 2002