• Corpus ID: 195218427

Data Cleansing for Models Trained with SGD

@article{Hara2019DataCF,
  title={Data Cleansing for Models Trained with SGD},
  author={Satoshi Hara and Atsushi Nitanda and Takanori Maehara},
  journal={ArXiv},
  year={2019},
  volume={abs/1906.08473}
}
Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. [] Key Method With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model.

Efficient Estimation of Influence of a Training Instance

The proposed method, inspired by dropout, can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.

Over-Fit: Noisy-Label Detection based on the Overfitted Model Property

This paper proposes a novel noisy-label detection algorithm by employing the property of overfitting on individual data points and presents two novel criteria that statistically measure how much each training sample abnormally affects the model and clean validation data.

Data Cleansing for Deep Neural Networks with Storage-efficient Approximation of Influence Functions

A method to reduce cache files to store the parameters of the model during training phase for inference phase to calculate influence sores and the accuracy improvement by data cleansing with removal of negatively influential data is observed.

Influence Estimation for Generative Adversarial Networks

An influence estimation method that uses the Jacobian of the gradient of the generator’s loss with respect to the discriminator's parameters and a novel evaluation scheme, in which the harmfulness of each training instance is assessed on the basis of how GAN evaluation metric is expected to change due to the removal of the instance, are proposed.

RNNRepair: Automatic RNN Repair via Model-based Analysis

A lightweight model-based approach (RNNRepair) to help understand and repair incorrect behaviors of an RNN to efficiently estimate the influence of existing or newly added training samples for a given prediction at both sample level and segmentation level.

Understanding Instance-based Interpretability of Variational Auto-Encoders

This paper investigates influence functions, a popular instance-based interpretation method, for a class of deep generative models called variational auto-encoders (VAE), and formally frame the counter-factual question answered by influence functions in this setting, and through theoretical analysis, examines what they reveal about the impact of training samples on classical unsupervised learning methods.

Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees

BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster.

Finding High-Value Training Data Subset through Differentiable Convex Programming

The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making the method scalable and resulting in a parameterised convex subset selection problem that is amenable to a differentiable convex programming paradigm.

No Regret Sample Selection with Noisy Labels

The proposed sample selection method adaptively selects a subset of noisy labeled training samples to minimize the regret of selecting noise samples and improves the performance of a black-box DNN with noisy labeled data.

HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks

This paper proposes Hypergradient Data Relevance Analysis, or HyDRA, which interprets the predictions made by DNNs as effects of their training data, and quantitatively demonstrates that hyDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels.

References

SHOWING 1-10 OF 24 REFERENCES

Learning to Reweight Examples for Robust Deep Learning

This work proposes a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions that can be easily implemented on any type of deep network, does not require any additional hyperparameter tuning, and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available.

Understanding Black-box Predictions via Influence Functions

This paper uses influence functions — a classic technique from robust statistics — to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction.

Interpreting Black Box Predictions using Fisher Kernels

This work takes a novel look at black box interpretation of test predictions in terms of training examples, making use of Fisher kernels as the defining feature embedding of each data point, combined with Sequential Bayesian Quadrature (SBQ) for efficient selection of examples.

Anomaly Detection with Robust Deep Autoencoders

Novel extensions to deep autoencoders are demonstrated which not only maintain a deep autenkocoders' ability to discover high quality, non-linear features but can also eliminate outliers and noise without access to any clean training data.

Learning with Noisy Labels

The problem of binary classification in the presence of random classification noise is theoretically studied—the learner sees labels that have independently been flipped with some small probability, and methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant.

Identifying Mislabeled Training Data

This paper uses a set of learning algorithms to create classifiers that serve as noise filters for the training data and suggests that for situations in which there is a paucity of data, consensus filters are preferred, whereas majority vote filters are preferable for situations with an abundance of data.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Isolation Forest

The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement.

Training Set Debugging Using Trusted Items

The approach seeks the smallest set of changes to the training set labels such that the model learned from this corrected training set predicts labels of the trusted items correctly, and is a step toward trustworthy machine learning.

Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery

AnoGAN, a deep convolutional generative adversarial network is proposed to learn a manifold of normal anatomical variability, accompanying a novel anomaly scoring scheme based on the mapping from image space to a latent space.