• Corpus ID: 235399978

# Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

@inproceedings{Ma2021DynaboardAE,
title={Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking},
author={Zhiyi Ma and Kawin Ethayarajh and Tristan Thrush and Somya Jain and Ledell Yu Wu and Robin Jia and Christopher Potts and Adina Williams and Douwe Kiela},
booktitle={NeurIPS},
year={2021}
}
• Published in NeurIPS 21 May 2021
• Computer Science
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on selfreported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows…
21 Citations

## Figures and Tables from this paper

• Computer Science
ACL
• 2022
We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP

### A global analysis of metrics used for measuring performance in natural language processing

• Computer Science
NLPPOWER
• 2022
The results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models’ performance, and ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.

### Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

• Computer Science
• 2021
The framework allows for the interpretability of different input attributes via transformations of the input, which is used to discover annotation artefacts in widely-used NLP benchmarks.

### Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages

• Computer Science
ArXiv
• 2022
This paper proposes an evaluation paradigm that assesses NLP technologies across all three dimensions, hence quantifying the diversity of users they can serve and calls upon the community to incorporate this evaluation paradigm when building linguistically diverse technologies.

### RealTime QA: What's the Answer Right Now?

• Computer Science
ArXiv
• 2022
We introduce R EAL T IME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). R E AL T IME QA inquires about the

### How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

• Computer Science
• 2022
This work identifies the implicit assumptions it makes about annotators and suggests improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation.

### Perturbation Augmentation for Fairer NLP

• Computer Science
ArXiv
• 2022
It is shown that language models pre-trained on demographically perturbed corpora are more fair, at least, according to the current best metrics for measuring model fairness, and that improved fairness does not come at the expense of accuracy.

### Fantastic Data and How to Query Them

• Computer Science
• 2022
This paper presents the vision about a unified framework for different datasets so that they can be integrated and queried easily, e.g., using standard query languages, and demonstrates this in the ongoing work to create a framework for datasets in Computer Vision.

### Predicting Fine-Tuning Performance with Probing

• Computer Science
• 2022
This paper finds that it is possible to use the accuracies of only three probing results to predict the fine-tuning performance with errors 40% - 80% smaller than baselines, and shows the possibility of incorporating specialized probing datasets into developing deep NLP models.

### Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold

• Computer Science
FINDINGS
• 2022
Historical and recent examples of how the square one bias has led researchers to draw false conclusions or make unwise choices are provided, point to promising yet unexplored directions on the research manifold, and make practical recommendations to enable more multi-dimensional research.

## References

SHOWING 1-10 OF 94 REFERENCES

### Dynabench: Rethinking Benchmarking in NLP

• Computer Science
NAACL
• 2021
It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.

### On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

• Computer Science
FAccT
• 2021
Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.

### Robustness Gym: Unifying the NLP Evaluation Landscape

• Computer Science
NAACL
• 2021
Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks, is proposed.

### DAWNBench : An End-to-End Deep Learning Benchmark and Competition

• Computer Science
• 2017
DAWNBench is introduced, a benchmark and competition focused on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference with that accuracy, and will provide a useful, reproducible means of evaluating the many tradeoffs in deep learning systems.

### The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

• Computer Science
GEM
• 2021
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

### TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

• Computer Science
ACL
• 2021
TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their

### Utility is in the Eye of the User: A Critique of NLP Leaderboards

• Economics
EMNLP
• 2020
This opinion paper formalizes how leaderboards – in their current form – can be poor proxies for the NLP community at large and advocates for more transparency on leaderboards, such as the reporting of statistics that are of practical concern.

### DynaSent: A Dynamic Benchmark for Sentiment Analysis

• Computer Science
ACL
• 2021
DynaSent (‘Dynamic Sentiment’), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis, is introduced and a report on the dataset creation effort is reported, focusing on the steps taken to increase quality and reduce artifacts.

### ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

• Computer Science
ICLR
• 2020
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.