• Corpus ID: 231855281

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

@article{Grathwohl2021OopsIT,
  title={Oops I Took A Gradient: Scalable Sampling for Discrete Distributions},
  author={Will Grathwohl and Kevin Swersky and Milad Hashemi and David Kristjanson Duvenaud and Chris J. Maddison},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.04509}
}
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a MetropolisHastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our… 
Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis-Hastings
TLDR
This paper interprets MLMs as energy-based sequence models and proposes two energy parametrizations derivable from the trained MLMs, and develops a tractable sampling scheme based on the Metropolis–Hastings Monte Carlo algorithm.
Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation
TLDR
GFlowNet is proposed, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph.
P ATH A UXILIARY P ROPOSAL FOR MCMC IN D IS CRETE S PACE
TLDR
A path auxiliary algorithm that uses a composition of local moves to explore large neigh-borhoods and considerably outperform other generic samplers on various discrete models for sampling, inference, and learning.
N O C ONDITIONAL M ODELS FOR ME : T RAINING J OINT EBM S ON M IXED C ONTINUOUS AND D ISCRETE D ATA
TLDR
Experimental results are presented demonstrating that the proposed approach can successfully train joint energy-based models on high-dimensional data with structured supervision capable of both accurate prediction and conditional sampling.
A Langevin-like Sampler for Discrete Distributions
We propose discrete Langevin proposal (DLP), a simple and scalable gradient-based proposal for sampling complex high-dimensional discrete distributions. In contrast to Gibbs sampling-based methods,
Gradient Estimation with Discrete Stein Operators
TLDR
In benchmark generative modeling tasks such as training binary variational autoencoders, the gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.
Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs
TLDR
This work proposes a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality, while providing explicit convergence bounds and diagnostics and shows that it can sample from such EBMs with arbitrary precision at the cost of sampling efficiency.
Perturb-and-max-product: Sampling and learning in discrete energy-based models
TLDR
This work presents perturb-and-max-product (PMP), a parallel and scalable mechanism for sampling and learning in discrete EBMs, and shows that for Ising models, PMP is orders of magnitude faster than Gibbs and Gibbs-withGradients (GWG) at learning and generating samples of similar or better quality.
Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable Selection
TLDR
A framework for efficient Markov Chain Monte Carlo algorithms targeting discrete-valued high-dimensional distributions, such as posterior distributions in Bayesian variable selection (BVS) problems is introduced and a novel algorithm, the Adaptive Random Neighbourhood Informed sampler (ARNI), is described.
Statistical applications of contrastive learning
TLDR
An introduction to contrastive learning is provided and how it can be used to derive methods for diverse statistical problems, namely parameter estimation for energy-based models, Bayesian inference for simulator- based models, as well as experimental design.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Auto-Encoding Variational Bayes
TLDR
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Stein Variational Inference for Discrete Distributions
TLDR
The proposed framework that transforms discrete distributions to equivalent piecewise continuous distributions, on which the gradient-free SVGD is applied to perform efficient approximate inference outperforms existing GOF test methods for intractable discrete distributions.
No MCMC for me: Amortized sampling for fast and stable training of energy-based models
TLDR
This work presents a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training, and improves upon prior MCMC-based entropy regularization methods with a fast variational approximation.
Implicit Generation and Generalization in Energy-Based Models
TLDR
This work presents techniques to scale MCMC based EBM training on continuous neural networks, and shows its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better samples than other likelihood models and nearing the performance of contemporary GAN approaches.
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One
TLDR
This approach is the first to achieve performance rivaling the state-of-the-art in both generative and discriminative learning within one hybrid model.
Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling
TLDR
A novel goodness-of-fit test which outperforms existing methods on high dimensional data and produces a novel method for training unnormalized models which scales more gracefully than existing methods.
Annealed importance sampling
TLDR
It is shown how one can use the Markov chain transitions for such an annealing sequence to define an importance sampler, which can be seen as a generalization of a recently-proposed variant of sequential importance sampling.
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the
Stein Variational Gradient Descent Without Gradient
TLDR
A gradient-free variant of SVGD (GF-SVGD), which replaces the true gradient with a surrogate gradient, and corrects the induced bias by re-weighting the gradients in a proper form, and shed insights on the empirical choice of the surrogate gradient.
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
TLDR
A new estimation principle is presented to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity, which leads to a consistent (convergent) estimator of the parameters.
...
...