The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

@inproceedings{Zhou2020TheCO,
  title={The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions},
  author={Xiang Zhou and Yixin Nie and Hao Tan and M. Bansal},
  booktitle={EMNLP},
  year={2020}
}
We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over… Expand
ANLIzing the Adversarial Natural Language Inference Dataset
Dynabench: Rethinking Benchmarking in NLP
ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation
Underspecification Presents Challenges for Credibility in Modern Machine Learning
ConjNLI: Natural Language Inference over Conjunctive Sentences
CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation
...
1
2
...

References

SHOWING 1-10 OF 63 REFERENCES
Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness
Analysis of Stopping Active Learning based on Stabilizing Predictions
SQuAD: 100, 000+ Questions for Machine Comprehension of Text
What Can We Learn from Collective Human Opinions on Natural Language Inference Data?
Annotation Artifacts in Natural Language Inference Data
Analyzing Compositionality-Sensitivity of NLI Models
Stress Test Evaluation for Natural Language Inference
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
...
1
2
3
4
5
...