Robustness Gym: Unifying the NLP Evaluation Landscape

@inproceedings{Goel2021RobustnessGU,
  title={Robustness Gym: Unifying the NLP Evaluation Landscape},
  author={Karan Goel and Nazneen Rajani and Jesse Vig and Samson Tan and Jason Wu and Stephan Zheng and Caiming Xiong and Mohit Bansal and Christopher R'e},
  booktitle={NAACL},
  year={2021}
}
Despite impressive performance on standard benchmarks, natural language processing (NLP) models are often brittle when deployed in real-world systems. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, RG enables… 
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
TLDR
This work systematically applies 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
TLDR
A framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text- to-text, or data-To-text settings is developed and applied to the GEM generation benchmark.
Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions
TLDR
It is argued that robustness should be multi-dimensional, provide insights into current research, identify gaps in the literature to suggest directions worth pursuing to address these gaps and take a deep-dive into the various dimensions of robustness, across techniques, metrics, embeddings, and benchmarks.
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition
TLDR
This work introduces the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs, and demonstrates the effectiveness of BPB by creating evaluation sets for three reading comprehension benchmarks, generating thousands of high-quality examples without human intervention.
Personalized Benchmarking with the Ludwig Benchmarking Toolkit
TLDR
The open-source Ludwig Benchmarking Toolkit (LBT) is introduced, a personalized benchmarking toolkit for running end-to-end benchmark studies across an easily extensible set of tasks, deep learning models, datasets and evaluation metrics, showing how LBT can be used to satisfy various benchmarking objectives.
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression
TLDR
Study of two popular model compression techniques including knowledge distillation and pruning show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets although they obtain similar performance on in-distribution development sets for a task.
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models
Measure and Improve Robustness in NLP Models: A Survey
TLDR
This paper unifies various lines of work on identifying robustness failures and evaluating models’ robustness, and presents mitigation strategies that are data-driven, model- driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in NLP models.
What do we Really Know about State of the Art NER?
TLDR
A broad evaluation of NER is performed using a popular dataset, that takes into consideration various text genres and sources constituting the dataset at hand, and recommends some useful reporting practices for NER researchers that could help in providing a better understanding of a SOTA model’s performance in future.
Testing Cross-Database Semantic Parsers Using Canonical Utterances
TLDR
This work characterized a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data, and evaluated a variety of high performing models using the proposed approach.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 109 REFERENCES
Adversarial NLI: A New Benchmark for Natural Language Understanding
TLDR
This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.
Analyzing Compositionality-Sensitivity of NLI Models
TLDR
This work proposes a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone, hence revealing the models' actual compositionality awareness.
Universal Adversarial Triggers for NLP
Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.
AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models
TLDR
This work introduces AllenNLP Interpret, a flexible framework for interpreting NLP models, which provides interpretation primitives for anyAllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components.
TextAttack: A Framework for Adversarial Attacks in Natural Language Processing
TextAttack is a library for running adversarial attacks against natural language processing (NLP) models. TextAttack builds attacks from four components: a search method, goal function,
Adversarial Examples for Evaluating Reading Comprehension Systems
TLDR
This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans.
REL: An Entity Linker Standing on the Shoulders of Giants
TLDR
The REL system is presented, building on state-of-the-art neural components from natural language processing research, provided as a Python package as well as a web API and reports on an experimental comparison against both well-established systems and the current state of theart on standard entity linking benchmarks.
The Effect of Natural Distribution Shift on Question Answering Models
TLDR
Four new test sets for the Stanford Question Answering Dataset are built and the ability of question-answering systems to generalize to new data is evaluated to confirm the surprising resilience of the holdout method and emphasize the need to move towards evaluation metrics that incorporate robustness to natural distribution shifts.
...
1
2
3
4
5
...