Corpus ID: 212644626

ReZero is All You Need: Fast Convergence at Large Depth

@article{Bachlechner2020ReZeroIA,
  title={ReZero is All You Need: Fast Convergence at Large Depth},
  author={Thomas C. Bachlechner and Bodhisattwa Prasad Majumder and Huanru Henry Mao and Garrison W. Cottrell and Julian J. McAuley},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.04887}
}
  • Thomas C. Bachlechner, Bodhisattwa Prasad Majumder, +2 authors Julian J. McAuley
  • Published in ArXiv 2020
  • Mathematics, Computer Science
  • Deep networks have enabled significant performance gains across domains, but they often suffer from vanishing/exploding gradients. This is especially true for Transformer architectures where depth beyond 12 layers is difficult to train without large datasets and computational budgets. In general, we find that inefficient signal propagation impedes learning in deep networks. In Transformers, multi-head self-attention is the main cause of this poor signal propagation. To facilitate deep signal… CONTINUE READING

    Citations

    Publications citing this paper.

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 27 REFERENCES

    Character-Level Language Modeling with Deeper Self-Attention

    VIEW 5 EXCERPTS
    HIGHLY INFLUENTIAL

    Attention is All you Need

    VIEW 8 EXCERPTS
    HIGHLY INFLUENTIAL

    Deep Residual Learning for Image Recognition

    VIEW 7 EXCERPTS
    HIGHLY INFLUENTIAL

    Language Models are Unsupervised Multitask Learners

    VIEW 5 EXCERPTS
    HIGHLY INFLUENTIAL

    Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

    VIEW 2 EXCERPTS
    HIGHLY INFLUENTIAL

    On Layer Normalization in the Transformer Architecture

    VIEW 3 EXCERPTS

    Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

    VIEW 1 EXCERPT