• Corpus ID: 235795741

Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks

  title={Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks},
  author={Carles Domingo-Enrich and Alberto Bietti and Marylou Gabri'e and Joan Bruna and Eric Vanden-Eijnden},
Energy-based models (EBMs) are generative models that are usually trained via maximum likelihood estimation. This approach becomes challenging in generic situations where the trained energy is non-convex, due to the need to sample the Gibbs distribution associated with this energy. Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow overparametrized neural network energies, both in the feature-learning and lazy linearised regimes… 

Figures from this paper

Simultaneous Transport Evolution for Minimax Equilibria on Measures

This work establishes global convergence towards the global equilibrium by using simultaneous gradient ascent-descent with respect to the Wasserstein metric – a dynamics that admits efficient particle discretization in high-dimensions, as opposed to entropic mirror descent.



On Energy-Based Models with Overparametrized Shallow Neural Networks

This work shows that models trained in the so-called ’active’ regime provide a statistical advantage over their associated ’lazy’ or kernel regime, leading to improved adaptivity to hidden low-dimensional structure in the data distribution, as already observed in supervised learning.

Implicit Generation and Generalization in Energy-Based Models

This work presents techniques to scale MCMC based EBM training on continuous neural networks, and shows its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better samples than other likelihood models and nearing the performance of contemporary GAN approaches.

How to Train Your Energy-Based Models

This tutorial starts by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching and Noise Constrastive Estimation, to highlight theoretical connections among these three approaches.

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.

Generative Modeling by Estimating Gradients of the Data Distribution

A new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching, which allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons.

A Dynamical Central Limit Theorem for Shallow Neural Networks

It is proved that the deviations from the mean-field limit scaled by the width, in the width-asymptotic limit, remain bounded throughout training and eventually vanish in the CLT scaling if themean-field dynamics converges to a measure that interpolates the training data.

Generative Moment Matching Networks

This work forms a method that generates an independent sample via a single feedforward pass through a multilayer perceptron, as in the recently proposed generative adversarial networks, using MMD to learn to generate codes that can then be decoded to produce samples.

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

This work proposes a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions and introduces the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score.

Efficient Learning of Sparse Representations with an Energy-Based Model

A novel unsupervised method for learning sparse, overcomplete features using a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector.

A mean field view of the landscape of two-layer neural networks

A compact description of the SGD dynamics is derived in terms of a limiting partial differential equation that allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.