• Corpus ID: 3861760

Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches

  title={Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches},
  author={Yeming Wen and Paul Vicol and Jimmy Ba and Dustin Tran and Roger Baker Grosse},
Stochastic neural net weights are used in a variety of contexts, including regularization, Bayesian neural nets, exploration in reinforcement learning, and evolution strategies. Unfortunately, due to the large number of weights, all the examples in a mini-batch typically share the same weight perturbation, thereby limiting the variance reduction effect of large mini-batches. We introduce flipout, an efficient method for decorrelating the gradients within a mini-batch by implicitly sampling… 

Figures and Tables from this paper

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

This paper proposes hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations, and proposes a parameter efficient version, hyper-batch ensembls, which builds on the layer structure of batch ensembleles and self-tuning networks.

An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise

The empirical studies with standard deep learning model-architectures and datasets shows that the proposed add covariance noise to the gradients method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated.

BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning

BatchEnsemble is proposed, an ensemble method whose computational and memory costs are significantly lower than typical ensembles and can easily scale up to lifelong learning on Split-ImageNet which involves 100 sequential learning tasks.

Efficient Low Rank Gaussian Variational Inference for Neural Networks

It is found that adding low- rank terms to parametrized diagonal covariance does not improve predictive performance except on small networks, but low-rank terms added to a constant diagonal covariances improves performance on small and large-scale network architectures.

Deep Ensembles: A Loss Landscape Perspective

Developing the concept of the diversity--accuracy plane, it is shown that the decorrelation power of random initializations is unmatched by popular subspace sampling methods and the experimental results validate the hypothesis that deep ensembles work well under dataset shift.

Learning Sparse Networks Using Targeted Dropout

Target dropout is introduced, a method for training a neural network so that it is robust to subsequent pruning, and improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.

HWA: Hyperparameters Weight Averaging in Bayesian Neural Networks

HWA (Hyperparameters Weight Averaging) algorithm is proposed that exploits an averaging procedure in order to optimize faster and achieve better accuracy and develops the main algorithm using the simple averaging heuristic.

AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly

This paper proposes an efficient method, AutoLRS, which automatically optimizes the learning rate (LR) schedule for each training stage by modeling training dynamics, and demonstrates the advantages and the generality of this method through extensive experiments of training DNNs for tasks from diverse domains using different optimizers.

Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning

A soft weight is introduced to represent the importance of each basis representing past tasks in GPM, which can be adaptively learned during the learning process, so that less important bases can be dynamically released to improve the sensitivity of new skill learning.

Simple, Distributed, and Accelerated Probabilistic Programming

A simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem, which achieves an optimal linear speedup from 1 to 256 TPUv2 chips.



Variational Dropout and the Local Reparameterization Trick

The Variational dropout method is proposed, a generalization of Gaussian dropout, but with a more flexibly parameterized posterior, often leading to better generalization in stochastic gradient variational Bayes.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Dropout: a simple way to prevent neural networks from overfitting

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Neural Variational Inference and Learning in Belief Networks

This work proposes a fast non-iterative approximate inference method that uses a feedforward network to implement efficient exact sampling from the variational posterior and shows that it outperforms the wake-sleep algorithm on MNIST and achieves state-of-the-art results on the Reuters RCV1 document dataset.

Regularizing and Optimizing LSTM Language Models

This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.

Training Recurrent Networks by Evolino

It is shown that Evolino-based LSTM can solve tasks that Echo State nets cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradient-basedLSTM.

Auto-Encoding Variational Bayes

A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.

Reducing Reparameterization Gradient Variance

This work views the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample, making it a useful control variate for variance reduction.

Bayesian Compression for Deep Learning

This work argues that the most principled and effective way to attack the problem of compression and computational efficiency in deep learning is by adopting a Bayesian point of view, where through sparsity inducing priors the authors prune large parts of the network.

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

This work applies a new variational inference based dropout technique in LSTM and GRU models, which outperforms existing techniques, and to the best of the knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank.