TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

  title={TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval},
  author={Sophia Althammer and Sebastian Hofst{\"a}tter and Suzan Verberne and Allan Hanbury},
  journal={Proceedings of the 31st ACM International Conference on Information \& Knowledge Management},
Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains click log data from the Trip search engine and includes two click-based test sets. However the… 

Figures and Tables from this paper



TripClick: The Log Files of a Large Health Web Search Engine

A large-scale domain-specific dataset of click logs, obtained from user interactions of the Trip Database health web search engine, which shows that the best performing neural IR model significantly improves the performance by a large margin relative to classical IR models.

Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models?

To test the reusability claim, TREC assessors were asked to judge new pools created from new search results for the TREC-8 ad hoc collection, which appears to have stood the test of time and remains a reliable evaluation instrument as retrieval techniques have advanced.

Variations in relevance judgments and the measurement of retrieval effectiveness

Very high correlations were found among the rankings of systems produced using diAerent relevance judgment sets, indicating that the comparative evaluation of retrieval performance is stable despite substantial diAerences in relevance judgments, and thus reaArm the use of the TREC collections as laboratory tools.

Comparative analysis of clicks and judgments for IR evaluation

This paper compares a traditional test collection with manual judgments to transaction log based test collections---by using queries as topics and subsequent clicks as pseudo-relevance judgments for the clicked results.

Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering

This work extends the ranked retrieval annotations of the Deep Learning track of TREC 2019 with passage and word level graded relevance annotations for all relevant documents, and presents FiRA: a novel dataset of Fine-Grained Relevance Annotations.

Bias and the limits of pooling for large collections

It is shown that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words.

Cross-domain Retrieval in the Legal and Patent Domains: a Reproducability Study

It is found that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.

Establishing Strong Baselines for TripClick Health Retrieval

It is shown that dense retrieval outperforms BM25 by considerable margins, even with simple training procedures, and the impact of different domainspecific pre-trained models on TripClick is studied.

On Building Fair and Reusable Test Collections using Bandit Techniques

Analysis demonstrates that the greedy approach common to most bandit methods can be unfair even to the runs participating in the collection-building process when the judgment budget is small relative to the (unknown) number of relevant documents.