• Corpus ID: 239015968

Information-Theoretic Measures of Dataset Difficulty

@article{Ethayarajh2021InformationTheoreticMO,
  title={Information-Theoretic Measures of Dataset Difficulty},
  author={Kawin Ethayarajh and Yejin Choi and Swabha Swayamdipta},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.08420}
}
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information. Measuring… 

Quantifying the Task-Specific Information in Text-Based Classifications

Recently, neural natural language models 001 have attained state-of-the-art performance on 002 a wide variety of tasks, but the high perfor003 mance can result from superficial, surface004 level cues

An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs

The effect of different synthetic datasets on language models with various architectures and sizes is studied to show that encoder-decoder models benefit from more data to learn from, whereas sampling strategies that balance across different aspects yield best performance.

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

This work introduces a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluation strength of humans, and demonstrates the promise of leveraging natural language generation techniques and re-imagining the role of humans in the dataset creation process.

Noise Audits Improve Moral Foundation Classification

This work proposes two metrics to audit the noise of annotations and shows that removing noisy annotations based on the proposed metrics improves classification performance.

Balanced Audiovisual Dataset for Imbalance Analysis

This work first split existing datasets into different subsets by estimating sample-wise modality discrepancy, and surprisingly finds that the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias.

Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding

Curriculum is introduced as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena and it is shown that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.

Systematic Evaluation of Automotive Intrusion Detection Datasets

This work investigates different characteristics of datasets for security applications and proposes a number of qualitative and quantitative metrics which can be evaluated with limited domain knowledge and demonstrates how the proposed metrics can be used to learn the strengths and weaknesses in these datasets.

References

SHOWING 1-10 OF 45 REFERENCES

Adversarial Filters of Dataset Biases

This work presents extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.

Show Your Work: Improved Reporting of Experimental Results

It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

Competency Problems: On Finding and Removing Artifacts in Language Data

This work argues that for complex language understanding tasks, all simple feature correlations are spurious, and formalizes this notion into a class of problems which are called competency problems, and gives a simple statistical test for dataset artifacts that is used to show more subtle biases.

What Do Models Learn from Question Answering Datasets?

It is found that no single dataset is robust to all of the authors' experiments and shortcomings in both datasets and evaluation methods are identified, which makes recommendations for building future QA datasets that better evaluate the task of question answering.

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier

LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.

Do NLP Models Know Numbers? Probing Numeracy in Embeddings

This work investigates the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset and finds this model excels on questions that require numerical reasoning, i.e., it already captures numeracy.

Conditional probing: measuring usable information beyond a baseline

This work extends a theory of usable information called V-information and proposes conditional probing, which explicitly conditions on the information in the baseline, which finds that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.

Combining Feature and Instance Attribution to Detect Artifacts

This paper proposes new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction) and shows that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available.

Systematic Error Analysis of the Stanford Question Answering Dataset

The outputs of multiple question answering (QA) models applied to the Stanford Question Answering Dataset (SQuAD) were analyzed to identify the core challenges for QA systems on this data set and challenging aspects were hypothesized through qualitative analysis of the common error cases.