• Corpus ID: 5808102

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

@article{Ioffe2015BatchNA,
  title={Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift},
  author={Sergey Ioffe and Christian Szegedy},
  journal={ArXiv},
  year={2015},
  volume={abs/1502.03167}
}
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. [...] Key Method Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a…Expand
Batch Normalization: Is Learning An Adaptive Gain and Bias Necessary?
TLDR
The effects of learnable parameters, gain and bias, on the training of various typical deep neural nets, including ALL-CNNs, Network In Network (NIN), ResNets are investigated and show that there is no big difference in both training convergence and final test accuracy if the BN layer is removed following the final convolutional layer from a Convolutional neural network (CNN).
Accelerating Training of Deep Neural Networks with a Standardization Loss
TLDR
A standardization loss is proposed to replace existing normalization methods with a simple, secondary objective loss that accelerates training on both small- and large-scale image classification experiments, works with a variety of architectures, and is largely robust to training across different batch sizes.
Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization
TLDR
This work introduces a new understanding of the cause of training instability and provides a technique that is independent of normalization and minibatch statistics, and for the first time shows that minibatches and normalization are unnecessary for state-of-the-art training.
Internal Covariate Shift Reduction in Encoder-Decoder Convolutional Neural Networks
TLDR
It is found that batch normalization increased the learning performance by 18% but also increased the training time in each epoch (iteration) by 26%.
Training Deep Neural Networks Without Batch Normalization
TLDR
The main purpose of this work is to determine if it is possible to train networks effectively when batch normalization is removed through adaption of the training process.
Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks
TLDR
This work exploits the observation that the pre-activation before Rectified Linear Units follow Gaussian distribution in deep networks, and that once the first and second order statistics of any given dataset are normalized, it can forward propagate this normalization without the need for recalculating the approximate statistics for hidden layers.
Training Faster by Separating Modes of Variation in Batch-Normalized Models
  • M. Kalayeh, M. Shah
  • Computer Science, Mathematics
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2020
TLDR
This work studies BN from the viewpoint of Fisher kernels that arise from generative probability models, and proposes a mixture of Gaussian densities for batch normalization, which reduces required number of gradient updates to reach the maximum test accuracy of the batch-normalized model.
Layer Normalization
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
TLDR
The Filter Response Normalization (FRN) layer is proposed, a novel combination of a normalization and an activation function that can be used as a replacement for other normalizations and activations, and outperforms BN and other alternatives in a variety of settings for all batch sizes.
Batch-normalized Mlpconv-wise supervised pre-training network in network
TLDR
A new deep architecture with enhanced model discrimination ability that is referred to as mlpconv-wise supervised pre-training network in network (MPNIN) is proposed, which may contribute to overcoming the difficulties of training deep networks by better initializing the weights in all the layers.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Deep Learning Made Easier by Linear Transformations in Perceptrons
TLDR
The usefulness of the transformations are confirmed, which make basic stochastic gradient learning competitive with state-of-the-art learning algorithms in speed and that they seem also to help find solutions that generalize better.
On the importance of initialization and momentum in deep learning
TLDR
It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
Dropout: a simple way to prevent neural networks from overfitting
TLDR
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging
TLDR
Another method is described, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow the periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.
Mean-normalized stochastic gradient for large-scale deep learning
TLDR
This work proposes a novel second-order stochastic optimization algorithm based on analytic results showing that a non-zero mean of features is harmful for the optimization, and proves convergence of the algorithm in a convex setting.
Natural Neural Networks
TLDR
A specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
TLDR
This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.
Knowledge Matters: Importance of Prior Information for Optimization
We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
...
1
2
3
4
...