Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

  title={Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics},
  author={Swabha Swayamdipta and Roy Schwartz and Nicholas Lourie and Yizhong Wang and Hannaneh Hajishirzi and Noah A. Smith and Yejin Choi},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the model's confidence in the true… 

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

This work focuses on providing a unified and efficient framework for Metadata Archaeology – uncovering and inferring metadata of examples in a dataset and is on par with far more sophisticated mitigation methods across different tasks.

Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

This paper proposes a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example, and shows results on detecting noisy labels and improving models’ metrics in synthetic and real datasets, as well as a productive dataset.

Training Dynamic based data filtering may not work for NLP datasets

This paper studies the applicability of the Area Under the Margin (AUM) metric to identify and remove/rectify mislabelled examples in NLP datasets and shows that models rely on the distributional information instead of relying on syntactic and semantic representations.

MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts

This work presents MetaShift, a collection of 12,868 sets of natural images across 410 classes that contains or-ders of magnitude more natural data shifts than previously available, and proposes methods to construct either binary or multiclass classification tasks, providing access to evaluate the model’s robustness across diverse distribution shifts.

A Data Cartography based MixUp for Pre-trained Language Models

This work proposes TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples and empirically validate that this method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT.

Assessing the quality of the datasets by identifying mislabeled samples

This paper proposes a novel statistic - noise score, as a measure for the quality of each data point to identify mislabeled samples based on the variations in the latent space representation derived by the inference network of data quality supervised variational autoencoder (AQUAVS).

Fix your Models by Fixing your Datasets

This work introduces a systematic framework for finding noisy or mislabelled samples in the dataset and identifying the most informative samples, which when included in training would provide maximal model performance lift.

Evaluating and Crafting Datasets Effective for Deep Learning With Data Maps

This work proposes a method of curating smaller datasets with comparable out-of-distribution model accuracy after an initial training session using an appropriate distribution of samples classed by how important it is for a model to learn from them.

Understanding Out-of-distribution: A Perspective of Data Dynamics

Despite machine learning models’ success in Natural Language Processing (NLP) tasks, predictions from these models frequently fail on out-of-distribution (OOD) samples. Prior works have focused on



Identifying Mislabeled Data using the Area Under the Margin Ranking

A new method to identify overly ambiguous or outrightly mislabeled samples and mitigate their impact when training neural networks is introduced, at the heart of which is the Area Under the Margin (AUM) statistic.

Training Region-Based Object Detectors with Online Hard Example Mining

OHEM is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use that leads to consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012.

REPAIR: Removing Representation Bias by Dataset Resampling

  • Yi LiN. Vasconcelos
  • Computer Science, Environmental Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
Experiments with synthetic and action recognition data show that dataset REPAIR can significantly reduce representation bias, and lead to improved generalization of models trained on REPAired datasets.

Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels

This paper finds that the test accuracy can be quantitatively characterized in terms of the noise ratio in datasets, and adopts the Co-teaching strategy which takes full advantage of the identified samples to train DNNs robustly against noisy labels.

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

A large-scale benchmark of existing state-of-the-art methods on classification problems and the effect of dataset shift on accuracy and calibration is presented, finding that traditional post-hoc calibration does indeed fall short, as do several other previous methods.

Adversarial Filters of Dataset Biases

This work presents extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.

Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee

This paper proposes and analyzes two simple and intuitive regularization methods and proves that gradient descent training with either of these two methods leads to a generalization guarantee on the clean data distribution despite being trained using noisy labels.

Get another label? improving data quality and data mining using multiple, noisy labelers

The results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.