# Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

@article{Ioffe2015BatchNA, title={Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift}, author={Sergey Ioffe and Christian Szegedy}, journal={ArXiv}, year={2015}, volume={abs/1502.03167} }

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. [...] Key Method Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a… Expand

## 27,060 Citations

Batch Normalization: Is Learning An Adaptive Gain and Bias Necessary?

- Computer ScienceICMLC
- 2018

The effects of learnable parameters, gain and bias, on the training of various typical deep neural nets, including ALL-CNNs, Network In Network (NIN), ResNets are investigated and show that there is no big difference in both training convergence and final test accuracy if the BN layer is removed following the final convolutional layer from a Convolutional neural network (CNN).

Accelerating Training of Deep Neural Networks with a Standardization Loss

- Computer Science, MathematicsArXiv
- 2019

A standardization loss is proposed to replace existing normalization methods with a simple, secondary objective loss that accelerates training on both small- and large-scale image classification experiments, works with a variety of architectures, and is largely robust to training across different batch sizes.

Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization

- Computer Science, MathematicsArXiv
- 2019

This work introduces a new understanding of the cause of training instability and provides a technique that is independent of normalization and minibatch statistics, and for the first time shows that minibatches and normalization are unnecessary for state-of-the-art training.

Internal Covariate Shift Reduction in Encoder-Decoder Convolutional Neural Networks

- Computer ScienceACM Southeast Regional Conference
- 2017

It is found that batch normalization increased the learning performance by 18% but also increased the training time in each epoch (iteration) by 26%.

Training Deep Neural Networks Without Batch Normalization

- Computer Science, MathematicsArXiv
- 2020

The main purpose of this work is to determine if it is possible to train networks effectively when batch normalization is removed through adaption of the training process.

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

- Mathematics, Computer ScienceICML
- 2016

This work exploits the observation that the pre-activation before Rectified Linear Units follow Gaussian distribution in deep networks, and that once the first and second order statistics of any given dataset are normalized, it can forward propagate this normalization without the need for recalculating the approximate statistics for hidden layers.

Training Faster by Separating Modes of Variation in Batch-Normalized Models

- Computer Science, MathematicsIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2020

This work studies BN from the viewpoint of Fisher kernels that arise from generative probability models, and proposes a mixture of Gaussian densities for batch normalization, which reduces required number of gradient updates to reach the maximum test accuracy of the batch-normalized model.

Layer Normalization

- Computer Science, MathematicsArXiv
- 2016

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called…

Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks

- Computer Science, Mathematics2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020

The Filter Response Normalization (FRN) layer is proposed, a novel combination of a normalization and an activation function that can be used as a replacement for other normalizations and activations, and outperforms BN and other alternatives in a variety of settings for all batch sizes.

Batch-normalized Mlpconv-wise supervised pre-training network in network

- Computer ScienceApplied Intelligence
- 2017

A new deep architecture with enhanced model discrimination ability that is referred to as mlpconv-wise supervised pre-training network in network (MPNIN) is proposed, which may contribute to overcoming the difficulties of training deep networks by better initializing the weights in all the layers.

## References

SHOWING 1-10 OF 33 REFERENCES

Deep Learning Made Easier by Linear Transformations in Perceptrons

- Computer ScienceAISTATS
- 2012

The usefulness of the transformations are confirmed, which make basic stochastic gradient learning competitive with state-of-the-art learning algorithms in speed and that they seem also to help find solutions that generalize better.

On the importance of initialization and momentum in deep learning

- Computer ScienceICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

Dropout: a simple way to prevent neural networks from overfitting

- Computer ScienceJ. Mach. Learn. Res.
- 2014

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Understanding the difficulty of training deep feedforward neural networks

- Computer Science, MathematicsAISTATS
- 2010

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

- Computer Science, MathematicsICLR
- 2015

Another method is described, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow the periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.

Mean-normalized stochastic gradient for large-scale deep learning

- Computer Science2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014

This work proposes a novel second-order stochastic optimization algorithm based on analytic results showing that a non-zero mean of features is harmful for the optimization, and proves convergence of the algorithm in a convex setting.

Natural Neural Networks

- Computer Science, MathematicsNIPS
- 2015

A specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015

This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.

Knowledge Matters: Importance of Prior Information for Optimization

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2016

We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed…

Large Scale Distributed Deep Networks

- Computer ScienceNIPS
- 2012

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.