The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

@inproceedings{Zhou2020TheCO,
  title={The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions},
  author={Xiang Zhou and Yixin Nie and Hao Tan and M. Bansal},
  booktitle={EMNLP},
  year={2020}
}
We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over… Expand
11 Citations
ANLIzing the Adversarial Natural Language Inference Dataset
  • 2
  • PDF
Dynabench: Rethinking Benchmarking in NLP
  • PDF
ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation
  • PDF
HateCheck: Functional Tests for Hate Speech Detection Models
  • 2
  • PDF
Underspecification Presents Challenges for Credibility in Modern Machine Learning
  • 51
  • PDF
ConjNLI: Natural Language Inference Over Conjunctive Sentences
  • 4
  • PDF
CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation
  • 1
  • PDF
...
1
2
...

References

SHOWING 1-10 OF 63 REFERENCES
Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness
  • 19
  • PDF
Analysis of Stopping Active Learning based on Stabilizing Predictions
  • 16
  • PDF
SQuAD: 100, 000+ Questions for Machine Comprehension of Text
  • 2,783
  • PDF
What Can We Learn from Collective Human Opinions on Natural Language Inference Data?
  • 9
  • PDF
Annotation Artifacts in Natural Language Inference Data
  • 416
  • Highly Influential
  • PDF
Improving Generalization by Incorporating Coverage in Natural Language Inference
  • 3
  • PDF
Analyzing Compositionality-Sensitivity of NLI Models
  • 37
  • PDF
Stress Test Evaluation for Natural Language Inference
  • 125
  • Highly Influential
  • PDF
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
  • 280
  • Highly Influential
  • PDF
The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models
  • 22
  • PDF
...
1
2
3
4
5
...