Evaluating NLP Models via Contrast Sets
@article{Gardner2020EvaluatingNM, title={Evaluating NLP Models via Contrast Sets}, author={Matt Gardner and Yoav Artzi and Victoria Basmova and Jonathan Berant and Ben Bogin and Sihao Chen and Pradeep Dasigi and Dheeru Dua and Yanai Elazar and Ananth Gottumukkala and Nitish Gupta and Hanna Hajishirzi and Gabriel Ilharco and Daniel Khashabi and Kevin Lin and Jiangming Liu and Nelson F. Liu and Phoebe Mulcaire and Qiang Ning and Sameer Singh and Noah A. Smith and Sanjay Subramanian and Reut Tsarfaty and Eric Wallace and A. Zhang and Ben Zhou}, journal={ArXiv}, year={2020}, volume={abs/2004.02709} }
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the… Expand
50 Citations
Geometry matters: Exploring language examples at the decision boundary
- Computer Science, Mathematics
- ArXiv
- 2020
- Highly Influenced
- PDF
Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data
- Computer Science
- EMNLP
- 2020
- 2
- PDF
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures
- Computer Science
- ArXiv
- 2020
- PDF
The Effect of Natural Distribution Shift on Question Answering Models
- Computer Science, Mathematics
- ICML
- 2020
- 14
- PDF
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
- Computer Science, Economics
- NeurIPS
- 2020
- 14
- PDF
References
SHOWING 1-10 OF 78 REFERENCES
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
- Computer Science
- EMNLP/IJCNLP
- 2019
- 82
- PDF
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
- Computer Science
- NAACL-HLT
- 2019
- 80
- Highly Influential
- PDF
More Bang for Your Buck: Natural Perturbation for Robust Question Answering
- Computer Science
- EMNLP
- 2020
- 5
- PDF
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
- Computer Science
- EMNLP
- 2018
- 292
- PDF
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
- Computer Science
- NAACL
- 2019
- 187
- PDF