Corpus ID: 220041504

Spherical Perspective on Learning with Batch Norm

  title={Spherical Perspective on Learning with Batch Norm},
  author={Simon Roburin and Yann de Mont-Marin and Andrei Bursuc and Renaud Marlet and P. P'erez and Mathieu Aubry},
Batch Normalization (BN) is a prominent deep learning technique. In spite of its apparent simplicity, its implications over optimization are yet to be fully understood. In this paper, we study the optimization of neural networks with BN layers from a geometric perspective. We leverage the radial invariance of groups of parameters, such as neurons for multi-layer perceptrons or filters for convolutional neural networks, and translate several popular optimization schemes on the $L_2$ unit… Expand
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
A unified picture of the main motivation behind different approaches from the perspective of optimization is provided, and a taxonomy for understanding the similarities and differences between them is presented. Expand
Inductive Bias of Gradient Descent for Exponentially Weight Normalized Smooth Homogeneous Neural Nets
This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate, and hence causes the weights to be updated in a way that prefers asymptotic relative sparsity, and demonstrates its potential applications in learning prunable neural networks. Expand


Understanding Batch Normalization
It is shown that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization, and contrasts the results against recent findings in random matrix theory, shedding new light on classical initialization schemes and their consequences. Expand
Norm matters: efficient and accurate normalization schemes in deep networks
A novel view is presented on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective, and a modification to weight-normalization, which improves its performance on large-scale tasks. Expand
Riemannian approach to batch normalization
This work proposes intuitive and effective gradient clipping and regularization methods for the proposed algorithm by utilizing the geometry of the Riemannian manifold, which provides a new learning rule that is more efficient and easier to analyze. Expand
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
It is shown that even if the authors fix the learning rate of scale-invariant parameters to a constant, gradient descent still approaches a stationary point in the rate of T^{-1/2}$ in iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. Expand
L2 Regularization versus Batch and Weight Normalization
It is shown that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate, and this leads to a discussion on other ways to mitigate this issue. Expand
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
A reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction is presented, improving the conditioning of the optimization problem and speeding up convergence of stochastic gradient descent. Expand
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand
Group Normalization
Group Normalization can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. Expand
Decoupled Weight Decay Regularization
This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance. Expand