Information-Theoretic Measures of Dataset Difficulty
@article{Ethayarajh2021InformationTheoreticMO, title={Information-Theoretic Measures of Dataset Difficulty}, author={Kawin Ethayarajh and Yejin Choi and Swabha Swayamdipta}, journal={ArXiv}, year={2021}, volume={abs/2110.08420} }
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information. Measuring…
Figures and Tables from this paper
8 Citations
Quantifying the Task-Specific Information in Text-Based Classifications
- Computer ScienceArXiv
- 2021
Recently, neural natural language models 001 have attained state-of-the-art performance on 002 a wide variety of tasks, but the high perfor003 mance can result from superficial, surface004 level cues…
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs
- Computer ScienceArXiv
- 2022
The effect of different synthetic datasets on language models with various architectures and sizes is studied to show that encoder-decoder models benefit from more data to learn from, whereas sampling strategies that balance across different aspects yield best performance.
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
- Computer ScienceEMNLP
- 2022
This work introduces a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluation strength of humans, and demonstrates the promise of leveraging natural language generation techniques and re-imagining the role of humans in the dataset creation process.
Noise Audits Improve Moral Foundation Classification
- Computer Science2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
- 2022
This work proposes two metrics to audit the noise of annotations and shows that removing noisy annotations based on the proposed metrics improves classification performance.
Balanced Audiovisual Dataset for Imbalance Analysis
- Computer ScienceArXiv
- 2023
This work first split existing datasets into different subsets by estimating sample-wise modality discrepancy, and surprisingly finds that the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias.
Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding
- Computer ScienceNAACL
- 2022
Curriculum is introduced as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena and it is shown that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
Systematic Evaluation of Automotive Intrusion Detection Datasets
- Computer ScienceCSCS
- 2022
This work investigates different characteristics of datasets for security applications and proposes a number of qualitative and quantitative metrics which can be evaluated with limited domain knowledge and demonstrates how the proposed metrics can be used to learn the strengths and weaknesses in these datasets.
References
SHOWING 1-10 OF 45 REFERENCES
Adversarial Filters of Dataset Biases
- Computer ScienceICML
- 2020
This work presents extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.
Show Your Work: Improved Reporting of Experimental Results
- Computer ScienceEMNLP
- 2019
It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Computer ScienceArXiv
- 2019
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Competency Problems: On Finding and Removing Artifacts in Language Data
- Computer ScienceEMNLP
- 2021
This work argues that for complex language understanding tasks, all simple feature correlations are spurious, and formalizes this notion into a class of problems which are called competency problems, and gives a simple statistical test for dataset artifacts that is used to show more subtle biases.
What Do Models Learn from Question Answering Datasets?
- Computer ScienceEMNLP
- 2020
It is found that no single dataset is robust to all of the authors' experiments and shortcomings in both datasets and evaluation methods are identified, which makes recommendations for building future QA datasets that better evaluate the task of question answering.
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier
- Computer ScienceNAACL
- 2016
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
Do NLP Models Know Numbers? Probing Numeracy in Embeddings
- Computer ScienceEMNLP
- 2019
This work investigates the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset and finds this model excels on questions that require numerical reasoning, i.e., it already captures numeracy.
Conditional probing: measuring usable information beyond a baseline
- Computer Science, PsychologyEMNLP
- 2021
This work extends a theory of usable information called V-information and proposes conditional probing, which explicitly conditions on the information in the baseline, which finds that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.
Combining Feature and Instance Attribution to Detect Artifacts
- Computer ScienceFINDINGS
- 2022
This paper proposes new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction) and shows that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available.
Systematic Error Analysis of the Stanford Question Answering Dataset
- Computer ScienceQA@ACL
- 2018
The outputs of multiple question answering (QA) models applied to the Stanford Question Answering Dataset (SQuAD) were analyzed to identify the core challenges for QA systems on this data set and challenging aspects were hypothesized through qualitative analysis of the common error cases.