• Corpus ID: 235399978

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

  title={Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking},
  author={Zhiyi Ma and Kawin Ethayarajh and Tristan Thrush and Somya Jain and Ledell Yu Wu and Robin Jia and Christopher Potts and Adina Williams and Douwe Kiela},
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on selfreported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows… 

Figures and Tables from this paper

Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP

A global analysis of metrics used for measuring performance in natural language processing

The results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models’ performance, and ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.

Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

The framework allows for the interpretability of different input attributes via transformations of the input, which is used to discover annotation artefacts in widely-used NLP benchmarks.

Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages

This paper proposes an evaluation paradigm that assesses NLP technologies across all three dimensions, hence quantifying the diversity of users they can serve and calls upon the community to incorporate this evaluation paradigm when building linguistically diverse technologies.

RealTime QA: What's the Answer Right Now?

We introduce R EAL T IME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). R E AL T IME QA inquires about the

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

This work identifies the implicit assumptions it makes about annotators and suggests improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation.

Perturbation Augmentation for Fairer NLP

It is shown that language models pre-trained on demographically perturbed corpora are more fair, at least, according to the current best metrics for measuring model fairness, and that improved fairness does not come at the expense of accuracy.

Fantastic Data and How to Query Them

This paper presents the vision about a unified framework for different datasets so that they can be integrated and queried easily, e.g., using standard query languages, and demonstrates this in the ongoing work to create a framework for datasets in Computer Vision.

Predicting Fine-Tuning Performance with Probing

This paper finds that it is possible to use the accuracies of only three probing results to predict the fine-tuning performance with errors 40% - 80% smaller than baselines, and shows the possibility of incorporating specialized probing datasets into developing deep NLP models.

Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold

Historical and recent examples of how the square one bias has led researchers to draw false conclusions or make unwise choices are provided, point to promising yet unexplored directions on the research manifold, and make practical recommendations to enable more multi-dimensional research.



Dynabench: Rethinking Benchmarking in NLP

It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.

Robustness Gym: Unifying the NLP Evaluation Landscape

Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks, is proposed.

DAWNBench : An End-to-End Deep Learning Benchmark and Competition

DAWNBench is introduced, a benchmark and competition focused on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference with that accuracy, and will provide a useful, reproducible means of evaluating the many tradeoffs in deep learning systems.

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their

Utility is in the Eye of the User: A Critique of NLP Leaderboards

This opinion paper formalizes how leaderboards – in their current form – can be poor proxies for the NLP community at large and advocates for more transparency on leaderboards, such as the reporting of statistics that are of practical concern.

DynaSent: A Dynamic Benchmark for Sentiment Analysis

DynaSent (‘Dynamic Sentiment’), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis, is introduced and a report on the dataset creation effort is reported, focusing on the steps taken to increase quality and reduce artifacts.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

The Ladder: A Reliable Leaderboard for Machine Learning Competitions

This work introduces a notion of leaderboard accuracy tailored to the format of a competition called the Ladder and demonstrates that it simultaneously supports strong theoretical guarantees in a fully adaptive model of estimation, withstands practical adversarial attacks, and achieves high utility on real submission files from an actual competition hosted by Kaggle.