# Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks

@article{DomingoEnrich2021DualTO, title={Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks}, author={Carles Domingo-Enrich and Alberto Bietti and Marylou Gabri'e and Joan Bruna and Eric Vanden-Eijnden}, journal={ArXiv}, year={2021}, volume={abs/2107.05134} }

Energy-based models (EBMs) are generative models that are usually trained via maximum likelihood estimation. This approach becomes challenging in generic situations where the trained energy is non-convex, due to the need to sample the Gibbs distribution associated with this energy. Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow overparametrized neural network energies, both in the feature-learning and lazy linearised regimes…

## One Citation

### Simultaneous Transport Evolution for Minimax Equilibria on Measures

- Computer ScienceArXiv
- 2022

This work establishes global convergence towards the global equilibrium by using simultaneous gradient ascent-descent with respect to the Wasserstein metric – a dynamics that admits efficient particle discretization in high-dimensions, as opposed to entropic mirror descent.

## References

SHOWING 1-10 OF 68 REFERENCES

### Implicit Generation and Generalization in Energy-Based Models

- Computer ScienceArXiv
- 2019

This work presents techniques to scale MCMC based EBM training on continuous neural networks, and shows its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better samples than other likelihood models and nearing the performance of contemporary GAN approaches.

### How to Train Your Energy-Based Models

- Computer ScienceArXiv
- 2021

This tutorial starts by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching and Noise Constrastive Estimation, to highlight theoretical connections among these three approaches.

### On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

- Computer ScienceNeurIPS
- 2018

It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.

### Generative Modeling by Estimating Gradients of the Data Distribution

- Computer ScienceNeurIPS
- 2019

A new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching, which allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons.

### A Dynamical Central Limit Theorem for Shallow Neural Networks

- MathematicsNeurIPS
- 2020

It is proved that the deviations from the mean-field limit scaled by the width, in the width-asymptotic limit, remain bounded throughout training and eventually vanish in the CLT scaling if themean-field dynamics converges to a measure that interpolates the training data.

### Generative Moment Matching Networks

- Computer ScienceICML
- 2015

This work forms a method that generates an independent sample via a single feedforward pass through a multilayer perceptron, as in the recently proposed generative adversarial networks, using MMD to learn to generate codes that can then be decoded to produce samples.

### GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

- Computer ScienceNIPS
- 2017

This work proposes a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions and introduces the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score.

### Efficient Learning of Sparse Representations with an Energy-Based Model

- Computer ScienceNIPS
- 2006

A novel unsupervised method for learning sparse, overcomplete features using a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector.

### A mean field view of the landscape of two-layer neural networks

- Computer ScienceProceedings of the National Academy of Sciences
- 2018

A compact description of the SGD dynamics is derived in terms of a limiting partial differential equation that allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.

### Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error

- Computer ScienceArXiv
- 2018

A Law of Large Numbers and a Central Limit Theorem for the empirical distribution are established, which together show that the approximation error of the network universally scales as O(n-1) and the scale and nature of the noise introduced by stochastic gradient descent are quantified.