Evaluating Models’ Local Decision Boundaries via Contrast Sets

@inproceedings{Gardner2020EvaluatingML,
  title={Evaluating Models’ Local Decision Boundaries via Contrast Sets},
  author={Matt Gardner and Yoav Artzi and Jonathan Berant and Ben Bogin and Sihao Chen and Dheeru Dua and Yanai Elazar and Ananth Gottumukkala and Nitish Gupta and Hannaneh Hajishirzi and Gabriel Ilharco and Daniel Khashabi and Kevin Lin and Jiangming Liu and Nelson F. Liu and Phoebe Mulcaire and Qiang Ning and Sameer Singh and Noah A. Smith and Sanjay Subramanian and Eric Wallace and Ally Zhang and Ben Zhou},
  booktitle={FINDINGS},
  year={2020}
}
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we… 

Figures and Tables from this paper

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA
TLDR
This work presents a novel method which leverages rich semantic input representation to automatically generate contrast sets for the visual question answering task and computes the answer of perturbed questions, thus vastly reducing annotation cost and enabling thorough evaluation of models’ performance on various semantic aspects.
What Will it Take to Fix Benchmarking in Natural Language Understanding?
TLDR
It is argued most current benchmarks fail at these criteria, and that adversarially-constructed, out-of-distribution test sets does not meaningfully address the causes of these failures.
More Bang for Your Buck: Natural Perturbation for Robust Question Answering
TLDR
It is found that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.
Does Putting a Linguist in the Loop Improve NLU Data Collection?
TLDR
It is found that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data, demonstrating the benefits of integrating expert analysis during data collection.
Deriving Behavioral Tests from Common Sense Knowledge Graphs
TLDR
This work introduces a semi-automated approach that leverages CSKGs to construct out-of-domain evaluation sets for NLP tasks that are more scalable than purely manual approaches.
Learning with Instance Bundles for Reading Comprehension
TLDR
Drawing on ideas from contrastive estimation, several new supervision losses are introduced that compare question-answer scores across multiple related instances, and normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy
TLDR
An algorithm inspired by adversarial machine learning techniques that uses a generative model to find naturally occurring instances misclassified by a model and proposes Defuse, a method that generates novel model misclassifications, categorizes these errors into high-level “model bugs”, and efficiently labels and finetunes on the errors to correct them.
Understanding Few-Shot Commonsense Knowledge Models
TLDR
This work investigates training commonsense knowledge models in a fewshot setting with limited tuples per commonsense relation in the graph and finds that human quality ratings for knowledge produced from a few-shot trained system can achieve performance within 6% of knowledgeproduced from fully supervised systems.
SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning
TLDR
Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs’ capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ.
Polyjuice: Automated, General-purpose Counterfactual Generation
TLDR
Polyjuice supports multiple use cases: by generating diverse counterfactuals for humans to label, Polyjuice helps produce highquality datasets for model training and evaluation, requiring 40% less human effort.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 80 REFERENCES
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
TLDR
It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Measuring and Mitigating Unintended Bias in Text Classification
TLDR
A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.
More Bang for Your Buck: Natural Perturbation for Robust Question Answering
TLDR
It is found that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.
Semantically Equivalent Adversarial Rules for Debugging NLP models
TLDR
This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.
Pathologies of Neural Models Make Interpretations Difficult
TLDR
This work uses input reduction, which iteratively removes the least important word from the input, to expose pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
TLDR
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.
A theory of learning from different domains
TLDR
A classifier-induced divergence measure that can be estimated from finite, unlabeled samples from the domains and shows how to choose the optimal combination of source and target error as a function of the divergence, the sample sizes of both domains, and the complexity of the hypothesis class.
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
...
1
2
3
4
5
...