Assessing the quality of the datasets by identifying mislabeled samples

@article{Pulastya2021AssessingTQ,
  title={Assessing the quality of the datasets by identifying mislabeled samples},
  author={Vaibhav Pulastya and Gaurav Nuti and Yash Kumar Atri and Tanmoy Chakraborty},
  journal={Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining},
  year={2021}
}
Due to the over-emphasize of the quantity of data, the data quality has often been overlooked. However, not all training data points contribute equally to learning. In particular, if mislabeled, it might actively damage the performance of the model and the ability to generalize out of distribution, as the model might end up learning spurious artifacts present in the dataset. This problem gets compounded by the prevalence of heavily parameterized and complex deep neural networks, which can, with… 

Figures and Tables from this paper

Do-AIQ: A Design-of-Experiment Approach to Quality Evaluation of AI Mislabel Detection Algorithm

TLDR
This work presents a principled framework of using a design-of-experimental approach to systematically evaluate the quality of AI algorithms, named as Do-AIQ, and can set an exemplar for AI algorithm to enhance the AI assurance of robustness, reproducibility, and transparency.

References

SHOWING 1-10 OF 38 REFERENCES

Learning from Noisy Labels with Distillation

TLDR
This work proposes a unified distillation framework to use “side” information, including a small clean dataset and label relations in knowledge graph, to “hedge the risk” of learning from noisy labels, and proposes a suite of new benchmark datasets to evaluate this task in Sports, Species and Artifacts domains.

Identifying Mislabeled Data using the Area Under the Margin Ranking

TLDR
A new method to identify overly ambiguous or outrightly mislabeled samples and mitigate their impact when training neural networks is introduced, at the heart of which is the Area Under the Margin (AUM) statistic.

Learning to Learn From Noisy Labeled Data

TLDR
This work proposes a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update, and trains the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise.

Learning from Noisy Labels with Deep Neural Networks: A Survey

TLDR
A comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority.

Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels

TLDR
This paper finds that the test accuracy can be quantitatively characterized in terms of the noise ratio in datasets, and adopts the Co-teaching strategy which takes full advantage of the identified samples to train DNNs robustly against noisy labels.

Training Deep Neural Networks on Noisy Labels with Bootstrapping

TLDR
A generic way to handle noisy and incomplete labeling by augmenting the prediction objective with a notion of consistency is proposed, which considers a prediction consistent if the same prediction is made given similar percepts, where the notion of similarity is between deep network features computed from the input data.

Training deep neural-networks using a noise adaptation layer

TLDR
This study presents a neural-network approach that optimizes the same likelihood function as optimized by the EM algorithm but extended to the case where the noisy labels are dependent on the features in addition to the correct labels.

Identifying Mislabeled Instances in Classification Datasets

TLDR
This paper presents a non-parametric end-to-end pipeline to find mislabeled instances in numerical, image and natural language datasets and evaluates the system quantitatively by adding a small number of label noise to 29 datasets, and shows that it can be found with an average precision of more than 0.84.

Training Convolutional Networks with Noisy Labels

TLDR
An extra noise layer is introduced into the network which adapts the network outputs to match the noisy label distribution and can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks.

Image Classification with Deep Learning in the Presence of Noisy Labels: A Survey