Assessing the quality of the datasets by identifying mislabeled samples

  title={Assessing the quality of the datasets by identifying mislabeled samples},
  author={Vaibhav Pulastya and Gaurav Nuti and Yash Kumar Atri and Tanmoy Chakraborty},
  journal={Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining},
Due to the over-emphasize of the quantity of data, the data quality has often been overlooked. However, not all training data points contribute equally to learning. In particular, if mislabeled, it might actively damage the performance of the model and the ability to generalize out of distribution, as the model might end up learning spurious artifacts present in the dataset. This problem gets compounded by the prevalence of heavily parameterized and complex deep neural networks, which can, with… 

Figures and Tables from this paper


Learning from Noisy Labels with Deep Neural Networks: A Survey
A comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority.
Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels
This paper finds that the test accuracy can be quantitatively characterized in terms of the noise ratio in datasets, and adopts the Co-teaching strategy which takes full advantage of the identified samples to train DNNs robustly against noisy labels.
Training Deep Neural Networks on Noisy Labels with Bootstrapping
A generic way to handle noisy and incomplete labeling by augmenting the prediction objective with a notion of consistency is proposed, which considers a prediction consistent if the same prediction is made given similar percepts, where the notion of similarity is between deep network features computed from the input data.
Training deep neural-networks using a noise adaptation layer
This study presents a neural-network approach that optimizes the same likelihood function as optimized by the EM algorithm but extended to the case where the noisy labels are dependent on the features in addition to the correct labels.
Identifying Mislabeled Instances in Classification Datasets
This paper presents a non-parametric end-to-end pipeline to find mislabeled instances in numerical, image and natural language datasets and evaluates the system quantitatively by adding a small number of label noise to 29 datasets, and shows that it can be found with an average precision of more than 0.84.
Training Convolutional Networks with Noisy Labels
An extra noise layer is introduced into the network which adapts the network outputs to match the noisy label distribution and can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks.
Dimensionality-Driven Learning with Noisy Labels
This work proposes a new perspective for understanding DNN generalization for such datasets, by investigating the dimensionality of the deep representation subspace of training samples, and develops a new dimensionality-driven learning strategy that can effectively learn low-dimensional local subspaces that capture the data distribution.
A two-stage ensemble method for the detection of class-label noise
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
The results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization, and a model-based tool to characterize and diagnose datasets.
Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach
It is proved that, when ReLU is the only non-linearity, the loss curvature is immune to class-dependent label noise, and it is shown how one can estimate these probabilities, adapting a recent technique for noise estimation to the multi-class setting, and providing an end-to-end framework.