Spherical perspective on learning with normalization layers

@article{Roburin2022SphericalPO,
  title={Spherical perspective on learning with normalization layers},
  author={Simon Roburin and Yann de Mont-Marin and Andrei Bursuc and Renaud Marlet and Patrick P{\'e}rez and Mathieu Aubry},
  journal={Neurocomputing},
  year={2022},
  volume={487},
  pages={66-74}
}

References

SHOWING 1-10 OF 34 REFERENCES

Deep Hyperspherical Learning

TLDR
Deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks are introduced, and it is shown that SphereNet can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and comparable (even better) classification accuracy over Convolutional counterparts.

L2 Regularization versus Batch and Weight Normalization

TLDR
It is shown that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate, and this leads to a discussion on other ways to mitigate this issue.

Norm matters: efficient and accurate normalization schemes in deep networks

TLDR
A novel view is presented on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective, and a modification to weight-normalization, which improves its performance on large-scale tasks.

Riemannian approach to batch normalization

TLDR
This work proposes intuitive and effective gradient clipping and regularization methods for the proposed algorithm by utilizing the geometry of the Riemannian manifold, which provides a new learning rule that is more efficient and easier to analyze.

Understanding Batch Normalization

TLDR
It is shown that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization, and contrasts the results against recent findings in random matrix theory, shedding new light on classical initialization schemes and their consequences.

Decoupled Networks

  • Weiyang LiuZ. Liu Le Song
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A generic decoupled learning framework which models the intra-class variation and semantic difference independently and directly learn the operator from data is proposed.

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

TLDR
A reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction is presented, improving the conditioning of the optimization problem and speeding up convergence of stochastic gradient descent.

Adam: A Method for Stochastic Optimization

TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Group Normalization

TLDR
Group Normalization (GN) is presented as a simple alternative to BN that can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks.

How Does Batch Normalization Help Optimization?

TLDR
It is demonstrated that such distributional stability of layer inputs has little to do with the success of BatchNorm, and this smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.