Corpus ID: 212736912

Rethinking Batch Normalization in Transformers

@article{Shen2020RethinkingBN,
  title={Rethinking Batch Normalization in Transformers},
  author={Sheng Shen and Zhewei Yao and Amir Gholami and Michael W. Mahoney and Kurt Keutzer},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.07845}
}
  • Sheng Shen, Zhewei Yao, +2 authors Kurt Keutzer
  • Published 2020
  • Computer Science
  • ArXiv
  • The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In… CONTINUE READING

    Citations

    Publications citing this paper.

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 51 REFERENCES

    EvalNorm: Estimating Batch Normalization Statistics for Evaluation

    VIEW 1 EXCERPT

    Group Normalization

    How Does Batch Normalization Help Optimization?

    VIEW 3 EXCERPTS
    HIGHLY INFLUENTIAL

    A Tensorized Transformer for Language Modeling

    VIEW 4 EXCERPTS
    HIGHLY INFLUENTIAL