Corpus ID: 212644626

ReZero is All You Need: Fast Convergence at Large Depth

@article{Bachlechner2020ReZeroIA,
  title={ReZero is All You Need: Fast Convergence at Large Depth},
  author={Thomas C. Bachlechner and Bodhisattwa Prasad Majumder and H. H. Mao and G. Cottrell and Julian McAuley},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.04887}
}
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change… Expand
On the Convergence of Deep Networks with Sample Quadratic Overparameterization
Rethinking Skip Connection with Layer Normalization
Rethinking Residual Connection with Layer Normalization
Evolving Normalization-Activation Layers
Deep Transformers with Latent Depth
Going deeper with Image Transformers
Multi-split Reversible Transformers Can Enhance Neural Machine Translation
  • Yuekai Zhao, Shuchang Zhou, Zhihua Zhang
  • Computer Science
  • EACL
  • 2021
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Batch Normalization Biases Deep Residual Networks Towards Shallow Paths
Identity Mappings in Deep Residual Networks
Identity Matters in Deep Learning
Understanding the difficulty of training deep feedforward neural networks
The Emergence of Spectral Universality in Deep Networks
Learning Identity Mappings with Residual Gates
...
1
2
3
4
5
...