• Corpus ID: 214802200

Evaluating NLP Models via Contrast Sets

@article{Gardner2020EvaluatingNM,
  title={Evaluating NLP Models via Contrast Sets},
  author={Matt Gardner and Yoav Artzi and Victoria Basmova and Jonathan Berant and Ben Bogin and Sihao Chen and Pradeep Dasigi and Dheeru Dua and Yanai Elazar and Ananth Gottumukkala and Nitish Gupta and Hannaneh Hajishirzi and Gabriel Ilharco and Daniel Khashabi and Kevin Lin and Jiangming Liu and Nelson F. Liu and Phoebe Mulcaire and Qiang Ning and Sameer Singh and Noah A. Smith and Sanjay Subramanian and Reut Tsarfaty and Eric Wallace and Ally Zhang and Ben Zhou},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.02709}
}
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the… 

Figures and Tables from this paper

Natural Perturbation for Robust Question Answering
TLDR
It is found that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.
Can NLI Models Verify QA Systems' Predictions?
TLDR
Careful manual analysis over the predictions of the NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence does not address all aspects of the question.
Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data
TLDR
This work identifies failure modes of SOTA relation extraction (RE) models trained on TACRED, which are attributed to limitations in the data annotation process, and provides concrete suggestion on how to improve RE data collection to alleviate this behavior.
Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding
TLDR
A VQG module that facilitate in automatically generating OOD shifts that facilitates in systematically evaluating cross-dataset adaptation capabilities of VQA models is working on at UCLA.
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures
TLDR
It is shown that without targeting a specific bias, the sentence augmentation improves the robustness of transformer models against multiple biases, and that models can still be vulnerable to the lexical overlap bias, even when the training data does not contain this bias.
The Effect of Natural Distribution Shift on Question Answering Models
TLDR
Four new test sets for the Stanford Question Answering Dataset are built and the ability of question-answering systems to generalize to new data is evaluated to confirm the surprising resilience of the holdout method and emphasize the need to move towards evaluation metrics that incorporate robustness to natural distribution shifts.
Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction
TLDR
From empirical experimentation, this study finds that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic) and highlights opportunities for future improvements.
DQI: Measuring Data Quality in NLP
TLDR
This work introduces a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of such unwanted biases, and uses DQI along with a couple of automation methods to renovate biased examples in SNLI.
Learning from Task Descriptions
TLDR
This work introduces a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area, and instantiates it with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks.
Understanding tables with intermediate pre-training
TLDR
This work adapts TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment, and creates a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 78 REFERENCES
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
TLDR
It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set.
Adversarial NLI: A New Benchmark for Natural Language Understanding
TLDR
This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Semantically Equivalent Adversarial Rules for Debugging NLP models
TLDR
This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.
Pathologies of Neural Models Make Interpretations Difficult
TLDR
This work uses input reduction, which iteratively removes the least important word from the input, to expose pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods.
Measuring and Mitigating Unintended Bias in Text Classification
TLDR
A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
TLDR
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
Stress Test Evaluation for Natural Language Inference
TLDR
This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.
...
1
2
3
4
5
...