Comparing Test Sets with Item Response Theory

  title={Comparing Test Sets with Item Response Theory},
  author={Clara Vania and Phu Mon Htut and William Huang and Dhara Mungra and Richard Yuanzhe Pang and Jason Phang and Haokun Liu and Kyunghyun Cho and Sam Bowman},
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly… Expand

Figures and Tables from this paper

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?
A Bayesian leaderboard model is created where latent subject skill and latent item difficulty predict correct responses and can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. Expand


Building an Evaluation Scale using Item Response Theory
The proposed Item Response Theory from psychometrics is shown to be able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. Expand
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
When Do You Need Billions of Words of Pretraining Data?
It is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models, and the ability to encode linguistic features is almost certainly necessary for language understanding. Expand
Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling
The first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling shows primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. Expand
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Expand
Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks
QuAIL is presented, the first RC dataset to combine text-based, world knowledge and unanswerable questions, and to provide question type annotation that would enable diagnostics of the reasoning strategies by a given QA system. Expand
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
A new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. Expand
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds
This work demonstrates the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks and demonstrates a use-case for latent difficulty item parameters, namely training set filtering, and shows that using difficulty to sample training data outperforms baseline methods. Expand
Know What You Don’t Know: Unanswerable Questions for SQuAD
SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Expand
NewsQA: A Machine Comprehension Dataset
NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs, is presented and analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. Expand