Corpus ID: 214802200

Evaluating NLP Models via Contrast Sets

@article{Gardner2020EvaluatingNM,
  title={Evaluating NLP Models via Contrast Sets},
  author={Matt Gardner and Yoav Artzi and Victoria Basmova and Jonathan Berant and Ben Bogin and Sihao Chen and Pradeep Dasigi and Dheeru Dua and Yanai Elazar and Ananth Gottumukkala and Nitish Gupta and Hanna Hajishirzi and Gabriel Ilharco and Daniel Khashabi and Kevin Lin and Jiangming Liu and Nelson F. Liu and Phoebe Mulcaire and Qiang Ning and Sameer Singh and Noah A. Smith and Sanjay Subramanian and Reut Tsarfaty and Eric Wallace and A. Zhang and Ben Zhou},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.02709}
}
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the… Expand
50 Citations
Geometry matters: Exploring language examples at the decision boundary
  • Highly Influenced
  • PDF
DQI: Measuring Data Quality in NLP
  • 5
  • PDF
UnifiedQA: Crossing Format Boundaries With a Single QA System
  • 47
  • PDF
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 78 REFERENCES
Adversarial NLI: A New Benchmark for Natural Language Understanding
  • 105
  • PDF
Annotation Artifacts in Natural Language Inference Data
  • 389
  • PDF
Semantically Equivalent Adversarial Rules for Debugging NLP models
  • 205
  • PDF
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
  • 80
  • Highly Influential
  • PDF
Pathologies of Neural Models Make Interpretation Difficult
  • 114
  • PDF
Measuring and Mitigating Unintended Bias in Text Classification
  • 165
  • PDF
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
  • 292
  • PDF
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
  • 187
  • PDF
...
1
2
3
4
5
...