Greedy Layer-Wise Training of Deep Networks

  title={Greedy Layer-Wise Training of Deep Networks},
  author={Yoshua Bengio and Pascal Lamblin and Dan Popovici and H. Larochelle},
Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. [] Key Method Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and…

Figures and Tables from this paper

Exploring Strategies for Training Deep Neural Networks

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input.

Learning Deep Architectures for AI

The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.

How do We Train Deep Architectures ?

The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.

How deep is deep enough? - Optimizing deep neural network architecture

This work introduces a new measure, called the generalized discrimination value (GDV), which quantifies how well different object classes separate in each layer of a Deep Belief Network that was trained unsupervised on the MNIST data set.

Understanding the difficulty of training deep feedforward neural networks

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

Training Larger Networks for Deep Reinforcement Learning

This paper proposes a novel method that consists of 1) wider networks with DenseNet connection, 2) decoupling representation learning from training of RL, 3) a distributed training method to mitigate overfitting problems and shows that it can train very large networks that result in significant performance gains.

A Greedy Algorithm for Building Compact Binary Activated Neural Networks

This work studies binary activated neural networks in the context of regression tasks, provides guarantees on the expressiveness of these particular networks and proposes a greedy algorithm for building such networks for predictors having small resources needs, which provides compact and sparse predictors.

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

Deep Complex Networks

This work relies on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and uses them in experiments with end-to-end training schemes and demonstrates that such complex- valued models are competitive with their real-valued counterparts.

Conditional Computation in Deep and Recurrent Neural Networks

Two cases of conditional computation are explored – in the feed forward case, a technique is developed that trades off accuracy for potential computational benefits, and in the recurrent case, techniques that yield practical speed benefits on a language modeling task are demonstrated.



A Fast Learning Algorithm for Deep Belief Nets

A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

Scaling learning algorithms towards AI

It is argued that deep architectures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence.

Convex Neural Networks

Training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem, which involves an infinite number of variables but can be solved by incrementally inserting a hidden unit at a time.

The Curse of Highly Variable Functions for Local Kernel Machines

We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior - with similarity between examples

A Monte Carlo EM Approach for Partially Observable Diffusion Processes: Theory and Applications to Neural Networks

It is proposed that the main reason that recurrent neural networks have not worked well in engineering applications is that they implicitly rely on a very simplistic likelihood model, and the diffusion network approach proposed here is much richer and may open new avenues for applications of recurrent Neural networks.

The Cascade-Correlation Learning Architecture

The Cascade-Correlation architecture has several advantages over existing algorithms: it learns very quickly, the network determines its own size and topology, it retains the structures it has built even if the training set changes, and it requires no back-propagation of error signals through the connections of the network.

Reducing the Dimensionality of Data with Neural Networks

This work describes an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

Practical issues in temporal difference learning

It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which surpasses comparable networks trained on a massive human expert data set.

Training Products of Experts by Minimizing Contrastive Divergence

A product of experts (PoE) is an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary because it is hard even to approximate the derivatives of the renormalization term in the combination rule.