Prioritized training on points that are learnable, worth learning, and not yet learned

  title={Prioritized training on points that are learnable, worth learning, and not yet learned},
  author={S{\"o}ren Mindermann and Muhammed Razzak and Winnie Xu and Andreas Kirsch and Mrinank Sharma and Adrien Morisot and Aidan N. Gomez and Sebastian Farquhar and Janina Brauner and Yarin Gal},
A new conference version of this workshop paper is available at: We introduce Goldilocks Selection , a technique for faster model training which selects a sequence of training points that are “just right”. We propose an information-theoretic acquisition function— the reducible validation loss—and compute it with a small proxy model—GoldiProx—to efficiently choose training points that maximize information about the labels of a validation set. We show that the… 

Figures and Tables from this paper

Test Distribution-Aware Active Learning: A Principled Approach Against Distribution Shift and Outliers

It is argued that conventional model-based methods for active learning—like BALD—have a fundamental shortfall: they fail to directly account for the testtime distribution of the input variables, and an acquisition strategy is revisited based on maximizing the expected information gained about possible future predictions.

Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling

This work discusses online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and proposes new challenging evaluation settings using active learning and active sampling, which are more realistic than previously suggested ones.

Improve Deep Image Inpainting by Emphasizing the Complexity of Missing Regions

A knowledge-assisted index composed of missingness complexity and forward loss is presented to guide the batch selection in the training procedure and helps find samples that are more conducive to optimization in each iteration and ultimately boost the overall inpainting performance.

Unifying Approaches in Data Subset Selection via Fisher Information and Information-Theoretic Quantities

The Fisher information is revisited and used to show how several otherwise disparate methods are connected as approximations of information-theoretic quantities.

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

This work focuses on providing a unified and efficient framework for Metadata Archaeology – uncovering and inferring metadata of examples in a dataset and is on par with far more sophisticated mitigation methods across different tasks.

Robust Active Distillation

Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

It is shown that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~ 68 and ~ 30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103.

Prioritizing Samples in Reinforcement Learning with Reducible Loss

An algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity is developed.



Not All Samples Are Created Equal: Deep Learning with Importance Sampling

A principled importance sampling scheme is proposed that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training, and derives a tractable upper bound to the per-sample gradient norm.

Selection Via Proxy: Efficient Data Selection For Deep Learning

This work shows that it can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection for tasks that will eventually require a large target model (e.g., selecting data points to label for active learning).

GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

This work forms GLISTER as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set, and proposes an iterative online algorithm GLISTER-ONLINE, which performs data selection iteratively along with the parameter updates, and can be applied to any loss-based learning algorithm.

Learning to Reweight Examples for Robust Deep Learning

This work proposes a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions that can be easily implemented on any type of deep network, does not require any additional hyperparameter tuning, and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available.

Online Batch Selection for Faster Training of Neural Networks

This work investigates online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam, and proposes a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank.

One Epoch Is All You Need

It is suggested to train on a larger dataset for only one epoch unlike the current practice, in which the unsupervised models are trained for from tens to hundreds of epochs, and the performance of Transformer language model becomes dramatically improved in this way.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Coresets for Data-efficient Training of Machine Learning Models

CRAIG is developed, a method to select a weighted subset of training data that closely estimates the full gradient by maximizing a submodular function and it is proved that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization.

An Empirical Model of Large-Batch Training

It is demonstrated that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets, reinforcement learning domains, and even generative model training (autoencoders on SVHN).

Incorporating Diversity in Active Learning with Support Vector Machines

This work presents a new approach that is especially designed to construct batches and incorporates a diversity measure that has low computational requirements making it feasible for large scale problems with several thousands of examples.