Training Products of Experts by Minimizing Contrastive Divergence

@article{Hinton2002TrainingPO,
  title={Training Products of Experts by Minimizing Contrastive Divergence},
  author={Geoffrey E. Hinton},
  journal={Neural Computation},
  year={2002},
  volume={14},
  pages={1771-1800}
}
It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual expert models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is… 
Products of Experts
Note thatMixture of Expert Modelsare usually associated with conditional models where the experts are of the formp(y|x) and the mixture coefficients (known as gating functions) may depend on x as
Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts
TLDR
A novel product-of-experts (PoE) based variational autoencoder that has these desired properties is proposed and an empirical evaluation shows that the PoE based models can outperform the contrasted models.
Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions
TLDR
This work identifies four desirable properties that are important for scalability, expressiveness and robustness, when learning and inferring with a combination of multiple models and shows that gPoE of Gaussian processes has these qualities, while no other existing combination schemes satisfy all of them at the same time.
Combining Classifiers and Learning Mixture-of-Experts
  • Lei Xu, S. Amari
  • Computer Science
    Encyclopedia of Artificial Intelligence
  • 2009
TLDR
The article aims at a general sketch of two streams of studies, not only with a re-elaboration of essential tasks, basic ingredients, and typical combining rules, but also with a general combination framework suggested to unify a number of typical classifier combination rules and several mixture based learning models.
Efficient training methods for conditional random fields
TLDR
This thesis investigates efficient training methods for conditional random fields with complex graphical structure, focusing on local methods which avoid propagating information globally along the graph, and proposes piecewise pseudolikelihood, a hybrid procedure which "pseudolikedlihood-izes" the piecewise likelihood, and is therefore more efficient if the variables have large cardinality.
Sequential Local Learning for Latent Graphical Models
TLDR
This paper introduces two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one and leads to a sequential learning framework that repeatedly increases the learning portion of given latent GM.
The Role of Mutual Information in Variational Classifiers
TLDR
Borders to the generalization error of classifiers relying on stochastic encodings trained on the cross-entropy loss are derived, which provide an information-theoretic understanding of generalization in the so-called class of variational classifiers, which are regularized by a Kullback-Leibler (KL) divergence term.
Latent regression Bayesian network for data representation
  • S. Nie, Yue Zhao, Q. Ji
  • Computer Science
    2016 23rd International Conference on Pattern Recognition (ICPR)
  • 2016
TLDR
This work proposes a counterpart of RBMs, namely latent regression Bayesian networks (LRBNs), which has a directed structure and employs the hard Expectation Maximization algorithm, which avoids the intractability of the traditional EM by max-out instead of sum-out to compute the data likelihood.
Multi-Conditional Learning for Joint Probability Models with Latent Variables
We introduce Multi-Conditional Learning, a framework for optimizing graphical models based not on joint likelihood, or on conditional likelihood, but based on a product of several marginal
Variational Noise-Contrastive Estimation
TLDR
It is proved that VNCE can be used for both parameter estimation of unnormalised models and posterior inference of latent variables, and has the same level of generality as standard VI, meaning that advances made there can be directly imported to the un normalised setting.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 34 REFERENCES
Recognizing Hand-written Digits Using Hierarchical Products of Experts
TLDR
On the MNIST database, the system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data.
Connectionist Learning of Belief Networks
Rate-coded Restricted Boltzmann Machines for Face Recognition
We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by
Unsupervised Learning of Distributions of Binary Vectors Using 2-Layer Networks
TLDR
It is shown that arbitrary distributions of binary vectors can be approximated by the combination model and shown how the weight vectors in the model can be interpreted as high order correlation patterns among the input bits, and how the combination machine can be used as a mechanism for detecting these patterns.
Using Generative Models for Handwritten Digit Recognition
TLDR
A method of recognizing handwritten digits by fitting generative models that are built from deformable B-splines with Gaussian "ink generators" spaced along the length of the spline using a novel elastic matching procedure based on the expectation maximization algorithm.
A Gradient-Based Boosting Algorithm for Regression Problems
TLDR
This work proposes an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm, and proves that this method reduces training error, and compares it to other regression methods.
Products of Hidden Markov Models
TLDR
A way of combining HMM's to form a distributed state time series model which can capture longer range structure than an HMM is capable of and some results on modelling character strings, a simple language task and the symbolic family trees problem are shown.
MAXIMUM ENTROPY
TLDR
This work shows that for all general Bayesian networks, the sequential maximum entropy model co incides with the unique joint distribution, and presents a new kind of maximum entropy models, which are computed sequentially.
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
  • S. Geman, D. Geman
  • Physics
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1984
TLDR
The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Unsupervised learning of distributions
TLDR
For asymptotically high dimensions N of the pattern space the distribution can be inferred exactly from p = O(N) examples up to a well-known remaining uncertainty in the preferential direction.
...
1
2
3
4
...