• Corpus ID: 239015968

Information-Theoretic Measures of Dataset Difficulty

  title={Information-Theoretic Measures of Dataset Difficulty},
  author={Kawin Ethayarajh and Yejin Choi and Swabha Swayamdipta},
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information. Measuring… 

Quantifying the Task-Specific Information in Text-Based Classifications

Recently, neural natural language models 001 have attained state-of-the-art performance on 002 a wide variety of tasks, but the high perfor003 mance can result from superficial, surface004 level cues

An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs

The effect of different synthetic datasets on language models with various architectures and sizes is studied to show that encoder-decoder models benefit from more data to learn from, whereas sampling strategies that balance across different aspects yield best performance.

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

This work introduces a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative power of humans.

Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding

Curriculum is introduced as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena and it is shown that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.



Adversarial Filters of Dataset Biases

This work presents extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.

Learning Whom to Trust with MACE

MACE (Multi-Annotator Competence Estimation) learns in an unsupervised fashion to identify which annotators are trustworthy and predict the correct underlying labels, and shows considerable improvements over standard baselines, both for predicted label accuracy and trustworthiness estimates.

Show Your Work: Improved Reporting of Experimental Results

It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

Competency Problems: On Finding and Removing Artifacts in Language Data

This work argues that for complex language understanding tasks, all simple feature correlations are spurious, and formalizes this notion into a class of problems which are called competency problems, and gives a simple statistical test for dataset artifacts that is used to show more subtle biases.

What Do Models Learn from Question Answering Datasets?

It is found that no single dataset is robust to all of the authors' experiments and shortcomings in both datasets and evaluation methods are identified, which makes recommendations for building future QA datasets that better evaluate the task of question answering.

Do NLP Models Know Numbers? Probing Numeracy in Embeddings

This work investigates the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset and finds this model excels on questions that require numerical reasoning, i.e., it already captures numeracy.

Conditional probing: measuring usable information beyond a baseline

This work extends a theory of usable information called V-information and proposes conditional probing, which explicitly conditions on the information in the baseline, which finds that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.

Combining Feature and Instance Attribution to Detect Artifacts

This paper proposes new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction) and shows that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.