• Corpus ID: 239015968

Information-Theoretic Measures of Dataset Difficulty

  title={Information-Theoretic Measures of Dataset Difficulty},
  author={Kawin Ethayarajh and Yejin Choi and Swabha Swayamdipta},
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information. Measuring… 
Quantifying the Task-Specific Information in Text-Based Classifications
Recently, neural natural language models 001 have attained state-of-the-art performance on 002 a wide variety of tasks, but the high perfor003 mance can result from superficial, surface004 level cues
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
This work introduces a novel paradigm for dataset creation based on human and machine collaboration, which brings together the generative strength of language models and the evaluativestrength of humans in a bid to curate NLP datasets of enhanced quality and diversity.


Learning Whom to Trust with MACE
MACE (Multi-Annotator Competence Estimation) learns in an unsupervised fashion to identify which annotators are trustworthy and predict the correct underlying labels, and shows considerable improvements over standard baselines, both for predicted label accuracy and trustworthiness estimates.
Show Your Work: Improved Reporting of Experimental Results
It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Combining Feature and Instance Attribution to Detect Artifacts
This paper proposes methods to facilitate identification of training data artifacts, using new hybrid approaches that combine saliency maps (which highlight ‘important’ input features) with instance attribution methods (which retrieve training samples ‘influential’ to a given prediction).
Do NLP Models Know Numbers? Probing Numeracy in Embeddings
This work investigates the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset and finds this model excels on questions that require numerical reasoning, i.e., it already captures numeracy.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
Systematic Error Analysis of the Stanford Question Answering Dataset
The outputs of multiple question answering (QA) models applied to the Stanford Question Answering Dataset (SQuAD) were analyzed to identify the core challenges for QA systems on this data set and challenging aspects were hypothesized through qualitative analysis of the common error cases.
Conditional probing: measuring usable information beyond a baseline
This work extends a theory of usable information called V-information and proposes conditional probing, which explicitly conditions on the information in the baseline, which finds that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.
What Makes Reading Comprehension Questions Easier?
This study proposes to employ simple heuristics to split each dataset into easy and hard subsets and examines the performance of two baseline models for each of the subsets, and observes that the baseline performances for thehard subsets remarkably degrade compared to those of entire datasets.
Annotation Artifacts in Natural Language Inference Data
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.