Evaluating Models’ Local Decision Boundaries via Contrast Sets
- Matt Gardner, Yoav Artzi, Ben Zhou
- Computer ScienceFindings
- 6 April 2020
A more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data, and recommends that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets.
Evaluating NLP Models via Contrast Sets
- Matt Gardner, Yoav Artzi, Ben Zhou
- Computer ScienceArXiv
- 6 April 2020
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.
ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension
- Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, Matt Gardner
- Computer ScienceArXiv
- 29 December 2019
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model's capability in understanding a wide variety of reading phenomena.
Dynamic Sampling Strategies for Multi-Task Reading Comprehension
- Ananth Gottumukkala, Dheeru Dua, Sameer Singh, Matt Gardner
- Computer ScienceAnnual Meeting of the Association for…
- 1 July 2020
This work shows that a simple dynamic sampling strategy, selecting instances for training proportional to the multi-task model’s current performance on a dataset relative to its single task performance, gives substantive gains over prior multi- Task sampling strategies, mitigating the catastrophic forgetting that is common in multi- task learning.
Comprehensive Multi-Dataset Evaluation of Reading Comprehension
- Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Matt Gardner, Sameer Singh
- Computer ScienceConference on Empirical Methods in Natural…
- 1 November 2019
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model’s capability in understanding a wide variety of reading phenomena.