• Corpus ID: 18201582

Variational Dropout Sparsifies Deep Neural Networks

@inproceedings{Molchanov2017VariationalDS,
  title={Variational Dropout Sparsifies Deep Neural Networks},
  author={Dmitry Molchanov and Arsenii Ashukha and Dmitry P. Vetrov},
  booktitle={ICML},
  year={2017}
}
We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance… 

Figures and Tables from this paper

Variational Dropout via Empirical Bayes
TLDR
It is shown that ARD applied to Bayesian DNNs with Gaussian approximate posterior distributions leads to a variational bound similar to that of variational dropout, and in the case of a fixed dropout rate, objectives are exactly the same.
Joint Inference for Neural Network Depth and Dropout Regularization
TLDR
This work proposes a unified Bayesian model selection method to jointly infer the most plausible network depth warranted by data, and perform dropout regularization simultaneously, and defines a beta process over the number of hidden layers which allows it to go to infinity.
Unifying the Dropout Family Through Structured Shrinkage Priors
TLDR
It is shown that multiplicative noise induces structured shrinkage priors on a network's weights, and it is derived that dropout's usual Monte Carlo training objective approximates marginal MAP estimation.
Improving Bayesian Inference in Deep Neural Networks with Variational Structured Dropout
TLDR
This work focuses on restrictions of the factorized structure of Dropout posterior which is inflexible to capture rich correlations among weight parameters of the true posterior, and proposes a novel method called Variational Structured Dropout (VSD) to overcome this limitation.
Dropout as a Structured Shrinkage Prior
TLDR
It is shown that multiplicative noise induces structured shrinkage priors on a network's weights, and it is leveraged to propose a novel shrinkage framework for resnets, terming the prior 'automatic depth determination' as it is the natural analog of automatic relevance determination for network depth.
Adaptive Network Sparsification via Dependent Variational Beta-Bernoulli Dropout
TLDR
Adaptive variational dropout whose probabilities are drawn from sparsity-inducing beta Bernoulli prior allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features.
In the nonparametric Bayesian perspective
TLDR
Adaptive variational dropout whose probabilities are drawn from sparsity-inducing beta-Bernoulli prior allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features.
Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee
TLDR
The empirical results demonstrate that this variational procedure provides uncertainty quantification in terms of Bayesian predictive distribution and is also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.
Learning Sparse Neural Networks via Sensitivity-Driven Regularization
TLDR
This work quantifies the output sensitivity to the parameters and introduces a regularization term that gradually lowers the absolute value of parameters with low sensitivity, so that a very large fraction of the parameters approach zero and are eventually set to zero by simple thresholding.
Radial and Directional Posteriors for Bayesian Deep Learning
We propose a new variational family for Bayesian neural networks. We decompose the variational posterior into two components, where the radial component captures the strength of each neuron in terms
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
Generalized Dropout
TLDR
A rich family of regularizers is introduced which is called Generalized Dropout and one set of methods in this family is a version of Dropout with trainable parameters, and Classical Dropout emerges as a special case of this method.
How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks
TLDR
This work proposes three advances in training algorithms of variational autoencoders, for the first time allowing to train deep models of up to five stochastic layers, using a structure similar to the Ladder network as the inference model and shows state-of-the-art log-likelihood results for generative modeling on several benchmark datasets.
Variational Dropout and the Local Reparameterization Trick
TLDR
The Variational dropout method is proposed, a generalization of Gaussian dropout, but with a more flexibly parameterized posterior, often leading to better generalization in stochastic gradient variational Bayes.
Information Dropout: Learning Optimal Representations Through Noisy Computation
TLDR
It is proved that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
Auto-Encoding Variational Bayes
TLDR
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Information Dropout: learning optimal representations through noise
TLDR
Information Dropout is introduced, a generalization of dropout that is motivated by the Information Bottleneck principle and it is found that information dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network.
Dropout as a Bayesian Approximation : Insights and Applications
TLDR
It is shown that a multilayer perceptron (MLP) with arbitrary depth and non-linearities, with dropout applied after every weight layer, is mathematically equivalent to an approximation to a well known Bayesian model.
The Power of Sparsity in Convolutional Neural Networks
TLDR
2D convolution is generalized to use a channel-wise sparse connection structure and it is shown that this leads to significantly better results than the baseline approach for large networks including VGG and Inception V3.
Group sparse regularization for deep neural networks
...
1
2
3
4
5
...