Training Products of Experts by Minimizing Contrastive Divergence
@article{Hinton2002TrainingPO, title={Training Products of Experts by Minimizing Contrastive Divergence}, author={Geoffrey E. Hinton}, journal={Neural Computation}, year={2002}, volume={14}, pages={1771-1800} }
It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual expert models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is…
4,547 Citations
Products of Experts
- Mathematics
- 2007
Note thatMixture of Expert Modelsare usually associated with conditional models where the experts are of the formp(y|x) and the mixture coefficients (known as gating functions) may depend on x as…
Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts
- Computer ScienceArXiv
- 2021
A novel product-of-experts (PoE) based variational autoencoder that has these desired properties is proposed and an empirical evaluation shows that the PoE based models can outperform the contrasted models.
Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions
- Computer ScienceArXiv
- 2014
This work identifies four desirable properties that are important for scalability, expressiveness and robustness, when learning and inferring with a combination of multiple models and shows that gPoE of Gaussian processes has these qualities, while no other existing combination schemes satisfy all of them at the same time.
Combining Classifiers and Learning Mixture-of-Experts
- Computer ScienceEncyclopedia of Artificial Intelligence
- 2009
The article aims at a general sketch of two streams of studies, not only with a re-elaboration of essential tasks, basic ingredients, and typical combining rules, but also with a general combination framework suggested to unify a number of typical classifier combination rules and several mixture based learning models.
Efficient training methods for conditional random fields
- Computer Science
- 2008
This thesis investigates efficient training methods for conditional random fields with complex graphical structure, focusing on local methods which avoid propagating information globally along the graph, and proposes piecewise pseudolikelihood, a hybrid procedure which "pseudolikedlihood-izes" the piecewise likelihood, and is therefore more efficient if the variables have large cardinality.
Sequential Local Learning for Latent Graphical Models
- Computer ScienceArXiv
- 2017
This paper introduces two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one and leads to a sequential learning framework that repeatedly increases the learning portion of given latent GM.
The Role of Mutual Information in Variational Classifiers
- Computer ScienceArXiv
- 2020
Borders to the generalization error of classifiers relying on stochastic encodings trained on the cross-entropy loss are derived, which provide an information-theoretic understanding of generalization in the so-called class of variational classifiers, which are regularized by a Kullback-Leibler (KL) divergence term.
Latent regression Bayesian network for data representation
- Computer Science2016 23rd International Conference on Pattern Recognition (ICPR)
- 2016
This work proposes a counterpart of RBMs, namely latent regression Bayesian networks (LRBNs), which has a directed structure and employs the hard Expectation Maximization algorithm, which avoids the intractability of the traditional EM by max-out instead of sum-out to compute the data likelihood.
Multi-Conditional Learning for Joint Probability Models with Latent Variables
- Computer Science
- 2006
We introduce Multi-Conditional Learning, a framework for optimizing graphical models based not on joint likelihood, or on conditional likelihood, but based on a product of several marginal…
Variational Noise-Contrastive Estimation
- Computer ScienceAISTATS
- 2019
It is proved that VNCE can be used for both parameter estimation of unnormalised models and posterior inference of latent variables, and has the same level of generality as standard VI, meaning that advances made there can be directly imported to the un normalised setting.
References
SHOWING 1-10 OF 34 REFERENCES
Recognizing Hand-written Digits Using Hierarchical Products of Experts
- Computer ScienceNIPS
- 2000
On the MNIST database, the system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data.
Rate-coded Restricted Boltzmann Machines for Face Recognition
- Computer ScienceNIPS
- 2000
We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by…
Unsupervised Learning of Distributions of Binary Vectors Using 2-Layer Networks
- Computer ScienceNIPS
- 1991
It is shown that arbitrary distributions of binary vectors can be approximated by the combination model and shown how the weight vectors in the model can be interpreted as high order correlation patterns among the input bits, and how the combination machine can be used as a mechanism for detecting these patterns.
Using Generative Models for Handwritten Digit Recognition
- Computer ScienceIEEE Trans. Pattern Anal. Mach. Intell.
- 1996
A method of recognizing handwritten digits by fitting generative models that are built from deformable B-splines with Gaussian "ink generators" spaced along the length of the spline using a novel elastic matching procedure based on the expectation maximization algorithm.
A Gradient-Based Boosting Algorithm for Regression Problems
- Computer ScienceNIPS
- 2000
This work proposes an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm, and proves that this method reduces training error, and compares it to other regression methods.
Products of Hidden Markov Models
- Computer ScienceAISTATS
- 2001
A way of combining HMM's to form a distributed state time series model which can capture longer range structure than an HMM is capable of and some results on modelling character strings, a simple language task and the symbolic family trees problem are shown.
MAXIMUM ENTROPY
- Computer Science
- 2000
This work shows that for all general Bayesian networks, the sequential maximum entropy model co incides with the unique joint distribution, and presents a new kind of maximum entropy models, which are computed sequentially.
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
- PhysicsIEEE Transactions on Pattern Analysis and Machine Intelligence
- 1984
The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Unsupervised learning of distributions
- Computer Science
- 1997
For asymptotically high dimensions N of the pattern space the distribution can be inferred exactly from p = O(N) examples up to a well-known remaining uncertainty in the preferential direction.