# SGB: Stochastic Gradient Bound Method for Optimizing Partition Functions

@article{Wang2020SGBSG, title={SGB: Stochastic Gradient Bound Method for Optimizing Partition Functions}, author={Junchang Wang and Anna Choromańska}, journal={ArXiv}, year={2020}, volume={abs/2011.01474} }

This paper addresses the problem of optimizing partition functions in a stochastic learning setting. We propose a stochastic variant of the bound majorization algorithm that relies on upper-bounding the partition function with a quadratic surrogate. The update of the proposed method, that we refer to as Stochastic Partition Function Bound (SPFB), resembles scaled stochastic gradient descent where the scaling factor relies on a second order term that is however different from the Hessian…

## References

SHOWING 1-10 OF 50 REFERENCES

### Majorization for CRFs and Latent Likelihoods

- Computer ScienceNIPS
- 2012

A quadratic variational upper bound is introduced to optimize partition functions and facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems.

### Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

- Computer ScienceICLR
- 2017

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.

### On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

- Computer ScienceICLR
- 2017

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

### A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

- Computer ScienceICML
- 2019

It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and open up a different perspective and shed more light on the belief that SGD prefers wide minima.

### How neural networks find generalizable solutions: Self-tuned annealing in deep learning

- Computer ScienceArXiv
- 2020

This study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.

### SmoothOut: Smoothing Out Sharp Minima for Generalization in Large-Batch Deep Learning

- Computer ScienceArXiv
- 2018

It is proved that the Stochastic SmoothOut is an unbiased approximation of the original SmoothOut and can eliminate sharp minima in Deep Neural Networks (DNNs) and thereby close generalization gap.

### Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2012

The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise and it is shown that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models.

### On Contrastive Divergence Learning

- Computer ScienceAISTATS
- 2005

The properties of CD learning are studied and it is shown that it provides biased estimates in general, but that the bias is typically very small.

### Statistical guarantees for the EM algorithm: From population to sample-based analysis

- Computer Science, MathematicsArXiv
- 2014

A general framework for proving rigorous guarantees on the performance of the EM algorithm and a variant known as gradient EM and consequences of the general theory for three canonical examples of incomplete-data problems are developed.

### Annealed importance sampling

- MathematicsStat. Comput.
- 2001

It is shown how one can use the Markov chain transitions for such an annealing sequence to define an importance sampler, which can be seen as a generalization of a recently-proposed variant of sequential importance sampling.