• Publications
  • Influence
Averaging Weights Leads to Wider Optima and Better Generalization
It is shown that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, and Stochastic Weight Averaging (SWA) is extremely easy to implement, improves generalization, and has almost no computational overhead.
Variational Dropout Sparsifies Deep Neural Networks
Variational Dropout is extended to the case when dropout rates are unbounded, a way to reduce the variance of the gradient estimator is proposed and first experimental results with individual drop out rates per weight are reported.
Tensorizing Neural Networks
This paper converts the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.
A Simple Baseline for Bayesian Uncertainty in Deep Learning
It is demonstrated that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
It is shown that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant, and a training procedure is introduced to discover these high-accuracy pathways between modes.
Spatially Adaptive Computation Time for Residual Networks
Experimental results are presented showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets and the computation time maps on the visual saliency dataset cat2000 correlate surprisingly well with human eye fixation positions.
Variational Autoencoder with Arbitrary Conditioning
We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in "one
Structured Bayesian Pruning via Log-Normal Multiplicative Noise
A new Bayesian model is proposed that takes into account the computational structure of neural networks and provides structured sparsity, e.g. removes neurons and/or convolutional channels in CNNs and provides significant acceleration on a number of deep neural architectures.
Ultimate tensorization: compressing convolutional and FC layers alike
This paper combines the proposed approach with the previous work to compress both convolutional and fully-connected layers of a network and achieve 80x network compression rate with 1.1% accuracy drop on the CIFAR-10 dataset.
Breaking Sticks and Ambiguities with Adaptive Skip-gram
The Adaptive Skip-gram model is proposed which is a nonparametric Bayesian extension of Skip- Gram capable to automatically learn the required number of representations for all words at desired semantic resolution and derives efficient online variational learning algorithm for the model and empirically demonstrates its efficiency on word-sense induction task.