Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

  title={Evaluate \& Evaluation on the Hub: Better Best Practices for Data and Model Measurements},
  author={Leandro von Werra and Lewis Tunstall and Abhishek Thakur and Alexandra Sasha Luccioni and Tristan Thrush and Aleksandra Piktus and Felix Marty and Nazneen Rajani and Victor Mustar and Helen Ngo and Omar Sanseviero and Mario vSavsko and Albert Villanova and Quentin Lhoest and Julien Chaumond and Margaret Mitchell and Alexander M. Rush and Thomas Wolf and Douwe Kiela},
Evaluation is a key part of machine learning (ML), yet there is a lack of support and tool-ing to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub —a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support reproducibility of evaluation, centralize and document the evaluation process, and broaden… 
1 Citations

Figures from this paper

Evaluation for Change

Evaluation is the central means for assessing, understanding, and communicating about NLP models. In this position paper, we argue evaluation should be more than that: it is a force for driving



Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models

Evaluation Gaps in Machine Learning Practice

The evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations are examined, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

A critical analysis of metrics used for measuring progress in artificial intelligence

The results suggest that the large majority of metrics currently used to evaluate classification AI benchmark tasks have properties that may result in an inadequate reflection of a classifiers' performance, especially when used with imbalanced datasets.

A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication

This paper studies the practice of dataset deprecation in ML, identifies several cases of datasets that continued to circulate despite having been deprecated, and proposes a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocols, and publication checks.

Datasets: A Community Library for Natural Language Processing

After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks.

Robustness Gym: Unifying the NLP Evaluation Landscape

Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks, is proposed.

What Will it Take to Fix Benchmarking in Natural Language Understanding?

It is argued most current benchmarks fail at these criteria, and that adversarially-constructed, out-of-distribution test sets does not meaningfully address the causes of these failures.

TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns

Widespread adoption of a centralised solution for developing, evaluating, and analysing TREC-like campaigns will ease the burden on organisers and provide participants and users with a standard environment for common IR experimental activities.

Deep Dominance - How to Properly Compare Deep Neural Models

The criteria for a high quality comparison method between DNNs is defined, and it is shown that the proposed test meets all criteria while previously proposed methods fail to do so.