Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

  title={Beyond Accuracy: Behavioral Testing of NLP Models with CheckList},
  author={Marco Tulio Ribeiro and Tongshuang Sherry Wu and Carlos Guestrin and Sameer Singh},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate… 

Figures and Tables from this paper

Red Teaming Language Models with Language Models

This work automatically finds cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM, and evaluates the target LM’s replies to generated test questions using a classifier trained to detect offensive content.

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist

A case study of using Checklist in a practical scenario for evaluating an offensive content detection system and a data augmentation technique for improving the model using insights from Checklist is presented.

Principles and Interactive Tools for Evaluating and Improving the Behavior of Natural Language Processing models

This thesis focuses on helping practitioners organize and explore the inputs and outputs of their models, such that they can gain more systematic insights into their models’ behaviors, and identifies two building blocks that are essential for informative analysis.

ABNIRML: Analyzing the Behavior of Neural IR Models

A new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML) is presented, which includes new types of diagnostic probes that allow us to test several characteristics—such as writing styles, factuality, sensitivity to paraphrasing and word order—that are not addressed by previous techniques.

ER-TEST Evaluating Explanation Regularization Methods for NLP Models

Through ER-TEST, it is shown that ER has little impact on ID performance, but can yield large gains on OOD performance w.r.t. (1)-(3), and that the best ER criterion is task-dependent, while ER can improve Ood performance even with limited human rationales.

Predicting Fine-Tuning Performance with Probing

This paper explores the utility of probing deep NLP models to extract a proxy signal widely used in model development – the tuning performance and suggests that it is possible to use the accuracies of only three probing tests to predict the performance with errors 40% - 80% smaller than baselines.

TestAug: A Framework for Augmenting Capability-based NLP Tests

This paper investigates a different approach that requires the developer to only annotate a few test templates, while leveraging the GPT-3 engine to generate the majority of test cases, and guarantees the correctness of the generated suites with a validity checker.

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

A framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text- to-text, or data-To-text settings is developed and applied to the GEM generation benchmark.

Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates, using negative flip rate as regression measure, and shows that regression has a prevalent presence across tasks in the GLUE benchmark.

Polyjuice: Automated, General-purpose Counterfactual Generation

Polyjuice supports multiple use cases: by generating diverse counterfactuals for humans to label, Polyjuice helps produce highquality datasets for model training and evaluation, requiring 40% less human effort.



Beyond Accuracy: Behavioral Testing of NLP Models with Checklist (Extended Abstract)

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

The results show that pretraining on CCG—the authors' most syntactic objective—performs the best on average across their probing tasks, suggesting that syntactic knowledge helps function word comprehension.

Semantically Equivalent Adversarial Rules for Debugging NLP models

This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.

Perturbation Sensitivity Analysis to Detect Unintended Model Biases

A generic evaluation framework, Perturbation Sensitivity Analysis, is proposed, which detects unintended model biases related to named entities, and requires no new annotations or corpora to be employed.

Errudite: Scalable, Reproducible, and Testable Error Analysis

This paper codifies model and task agnostic principles for informative error analysis, and presents Errudite, an interactive tool for better supporting this process, and enables users to perform high quality and reproducible error analyses with less effort.

Models in the Wild: On Corruption Robustness of Neural NLP Systems

This paper introduces WildNLP - a framework for testing model stability in a natural setting where text corruptions such as keyboard errors or misspelling occur, and compares robustness of deep learning models from 4 popular NLP tasks by testing their performance on aspects introduced in the framework.

Universal Adversarial Triggers for Attacking and Analyzing NLP

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).