• Corpus ID: 5575601

# Understanding the difficulty of training deep feedforward neural networks

@inproceedings{Glorot2010UnderstandingTD,
title={Understanding the difficulty of training deep feedforward neural networks},
author={Xavier Glorot and Yoshua Bengio},
booktitle={AISTATS},
year={2010}
}
• Published in AISTATS 31 March 2010
• Computer Science
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. [] Key Result Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural…
12,482 Citations

## Figures and Tables from this paper

On the importance of initialization and momentum in deep learning
• Computer Science
ICML
• 2013
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
PRESSION IN DEEP NEURAL NETWORKS
• Computer Science
• 2019
More robust mutual information estimation techniques are developed, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions, which explore compression in networks with a range of different activation functions.
The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
• Computer Science
ICML
• 2020
It is observed that the combination of batch normalization and skip connections reduces gradient confusion, which helps reduce the training burden of very deep networks with Gaussian initializations.
Deep neural networks
This memoir will provide a brief summary of the models and techniques considered classical neural networks, including different architectures and training techniques, as well as some related statistical methods, and present the current state of the new deep networks.
New architectures for very deep learning
This thesis develops new architectures that, for the first time, allow very deep networks to be optimized efficiently and reliably and addresses two key issues that hamper credit assignment in neural networks: cross-pattern interference and vanishing gradients.
Effects of Sparse Initialization in Deep Belief Networks
• Computer Science
Comput. Sci.
• 2015
The motivation behind this research is the observation that SI has an impact on the features learned by a DBN during pretraining, and observation that when pretraining starts from sparsely initialized weight matrices networks achieve lower classification error after fine-tuning.
Optimization-Based Separations for Neural Networks
• Computer Science
ArXiv
• 2021
It is proved that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training.
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
• Computer Science
NIPS
• 2017
A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps.
Training of Deep Neural Networks based on Distance Measures using RMSProp
• Computer Science
ArXiv
• 2017
This first paper revisits neural networks built up of layers based on distance measures and Gaussian activation functions and shows that by using Root Mean Square Propagation (RMSProp) it is possible to efficiently learn multi-layer neural networks.
On the Power and Limitations of Random Features for Understanding Neural Networks
• Computer Science
NeurIPS
• 2019
This paper rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size is exponentially large, and concludes that a single neuron is learnable with gradient-based methods.

## References

SHOWING 1-10 OF 26 REFERENCES
Exploring Strategies for Training Deep Neural Networks
• Computer Science
J. Mach. Learn. Res.
• 2009
These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input.
Greedy Layer-Wise Training of Deep Networks
• Computer Science
NIPS
• 2006
These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training
• Computer Science
AISTATS
• 2009
The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.
A Fast Learning Algorithm for Deep Belief Nets
• Computer Science
Neural Computation
• 2006
A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Learning Multiple Layers of Features from Tiny Images
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.
Learning Deep Architectures for AI
The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.
Extracting and composing robust features with denoising autoencoders
• Computer Science
ICML '08
• 2008
This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.
An empirical evaluation of deep architectures on problems with many factors of variation
• Computer Science
ICML '07
• 2007
A series of experiments indicate that these models with deep architectures show promise in solving harder learning problems that exhibit many factors of variation.
Learning long-term dependencies with gradient descent is difficult
• Computer Science
IEEE Trans. Neural Networks
• 1994
This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
Learning representations by back-propagating errors
• Computer Science
Nature
• 1986
Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.