Streamlining Evaluation with ir-measures

  title={Streamlining Evaluation with ir-measures},
  author={Sean MacAvaney and Craig MacDonald and Iadh Ounis},
We present ir-measures, a new tool that makes it convenient to calculate a diverse set of evaluation measures used in information retrieval. Rather than implementing its own measure calculations, ir-measures provides a common interface to a handful of evaluation tools. The necessary tools are automatically invoked (potentially multiple times) to calculate all the desired metrics, simplifying the evaluation process for the user. The tool also makes it easier for researchers to use recently… 
CODEC: Complex Document and Entity Collection
Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods and shows significant gains in document ranking, demonstrating the resource’s value for evaluating and improving entity-oriented search.
TARexp: A Python Framework for Technology-Assisted Review Experiments
Key characteristics of this framework are declarative representations of workflows and experiment plans, the ability for components to play variable numbers of workflow roles, and state maintenance and restart capabilities.


INST: An Adaptive Metric for Information Retrieval Evaluation
The result is a specification for a program inst_eval for use in TREC-style IR experimentation, and a number of pragmatic issues that need to be taken in to account when writing an implementation.
cwl_eval: An Evaluation Tool for Information Retrieval
The cwl_eval architecture is described, which unifies many metrics typically used to evaluate information retrieval systems using test collections and promotes a standardised approach to evaluating search effectiveness.
Models and metrics: IR evaluation as a user process
This work explores the linkage between models and metrics, considering a range of effectiveness metrics, and the user search behavior that each of them implies, and examines more complex user models, as a guide to the development of new effectiveness metrics.
Novelty and diversity in information retrieval evaluation
This paper develops a framework for evaluation that systematically rewards novelty and diversity into a specific evaluation measure, based on cumulative gain, and demonstrates the feasibility of this approach using a test collection based on the TREC question answering track.
Retrieval evaluation with incomplete information
It is shown that current evaluation measures are not robust to substantially incomplete relevance judgments, and a new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets.
Cumulated gain-based evaluation of IR techniques
This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.
Evaluation Issues in Information Retrieval
  • D. Harman
  • Computer Science
    Inf. Process. Manag.
  • 1992
TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns
Widespread adoption of a centralised solution for developing, evaluating, and analysing TREC-like campaigns will ease the burden on organisers and provide participants and users with a standard environment for common IR experimental activities.
An Effectiveness Measure for Ambiguous and Underspecified Queries
A new measure of novelty and diversity for information retrieval evaluation is proposed in an attempt to achieve a balance between the complexity of genuine users needs and the simplicity required for feasible evaluation.
DiffIR: Exploring Differences in Ranking Models' Behavior
DiffIR is a new open-source web tool to assist with qualitative ranking analysis by visually 'diffing' system rankings at the individual result level for queries where behavior significantly diverges.