Corpus ID: 53104146

How Does Batch Normalization Help Optimization?

@inproceedings{Santurkar2018HowDB,
  title={How Does Batch Normalization Help Optimization?},
  author={Shibani Santurkar and Dimitris Tsipras and Andrew Ilyas and Aleksander Madry},
  booktitle={NeurIPS},
  year={2018}
}
Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called "internal covariate shift". In this work, we demonstrate that such distributional stability of layer… Expand
Batch Normalization Biases Deep Residual Networks Towards Shallow Paths
TLDR
The origin of the most important benefit of batch normalization arises in residual networks, where it dramatically increases the largest trainable depth, is identified and a simple initialization scheme is developed which can train very deep residual networks without normalization. Expand
Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization
TLDR
It is argued that this acceleration is due to the fact that Batch Normalization splits the optimization task into optimizing length and direction of the parameters separately, which allows gradient-based methods to leverage a favourable global structure in the loss landscape. Expand
Batch Normalization Preconditioning for Neural Network Training
TLDR
A new method called Batch Normalization Preconditioning (BNP) is proposed, which applies normalization by conditioning the parameter gradients directly during training to improve the Hessian matrix of the loss function and hence convergence during training. Expand
Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning
TLDR
A theoretical approach is taken, generalizing the known beneficial mechanisms of BatchNorm to several recently proposed normalization techniques, revealing a unified set of mechanisms that underpin the success of normalization methods in deep learning. Expand
Accelerating Training of Deep Neural Networks with a Standardization Loss
TLDR
A standardization loss is proposed to replace existing normalization methods with a simple, secondary objective loss that accelerates training on both small- and large-scale image classification experiments, works with a variety of architectures, and is largely robust to training across different batch sizes. Expand
Four Things Everyone Should Know to Improve Batch Normalization
TLDR
Four improvements to the generic form of Batch Normalization are identified and identified, yielding performance gains across all batch sizes while requiring no additional computation during training. Expand
Improving Batch Normalization with Skewness Reduction for Deep Neural Networks
TLDR
It is demonstrated that the performance of the network can be improved, if the distributions of the features of the output in the same layer are similar, and a new normalization scheme is proposed: Batch Normalization with Skewness Reduction (BNSR). Expand
Training Deep Neural Networks Without Batch Normalization
TLDR
The main purpose of this work is to determine if it is possible to train networks effectively when batch normalization is removed through adaption of the training process. Expand
Theoretical Understanding of Batch-normalization: A Markov Chain Perspective
TLDR
This work shows that BN has a direct effect on the rank of the pre-activation matrices of a neural network, and shows that the latter quantity is a good predictor for the optimization speed of training. Expand
Batch Normalization with Enhanced Linear Transformation
TLDR
This paper proposes to additionally consider each neuron's neighborhood for calculating the outputs of the linear transformation module of batch normalization, and proves that BNET accelerates the convergence of network training and enhances spatial information by assigning the important neurons with larger weights accordingly. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Towards a Theoretical Understanding of Batch Normalization
TLDR
This work identifies various problem instances in the realm of machine learning where, under certain assumptions, Batch Normalization can provably accelerate optimization with gradient-based methods and turns Batch normalization from an effective practical heuristic into a provably converging algorithm for these settings. Expand
Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization
TLDR
It is argued that this acceleration is due to the fact that Batch Normalization splits the optimization task into optimizing length and direction of the parameters separately, which allows gradient-based methods to leverage a favourable global structure in the loss landscape. Expand
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand
Identity Matters in Deep Learning
TLDR
This work gives a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima and shows that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size. Expand
On the importance of single directions for generalization
TLDR
It is found that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance. Expand
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
TLDR
A reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction is presented, improving the conditioning of the optimization problem and speeding up convergence of stochastic gradient descent. Expand
Layer Normalization
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique calledExpand
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
TLDR
It is shown that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, thegradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Expand
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand
Self-Normalizing Neural Networks
TLDR
Self-normalizing neural networks (SNNs) are introduced to enable high-level abstract representations and it is proved that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero meanand unit variance -- even under the presence of noise and perturbations. Expand
...
1
2
3
4
...