DeepNet: Scaling Transformers to 1, 000 Layers

@article{Wang2022DeepNetST,
  title={DeepNet: Scaling Transformers to 1, 000 Layers},
  author={Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.00555}
}
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DEEPNORM a… 
FoundationLayerNorm: Scaling BERT and GPT to 1, 000 Layers
TLDR
The proposed method FoundationLayerNormalization enables e-cient training of deep neural networks and is validated at the 1000-layer scale and successfully scale up BERT and GPT to 1,000 layers, which is an order of magnitude deeper than previous BERT or GPT.
Language Models are General-Purpose Interfaces
TLDR
This work proposes to use language models as a general-purpose interface to various foundation models to jointly pretrain the interface and the modular encoders, and subsume the advantages and capabilities from both causal and non-causal modeling.
Deeper vs Wider: A Revisit of Transformer Configuration
TLDR
Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training, is proposed, effective in alleviating the over-smoothing issue in deep transformer training.
Insights into Pre-training via Simpler Synthetic Tasks
TLDR
This work performs three experiments that iteratively simplify pre-training and shows that the simplifications still retain much of its gains, including LIME, the best synthetic pre- training method.
Scaling ResNets in the Large-depth Regime
TLDR
This analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index, and exhibits a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
TLDR
It is shown that rank collapse of the tokens’ representations hinders training by causing the gradients of the queries and keys to vanish at initialization, and it is revealed that architectural hyperparameters affect theGradients of queries and values differently, leading to disproportionate gradient norms.
Entangled Residual Mappings
TLDR
While entangled mappings can preserve the iterative refinement of features across various deep models, they influence the representation learning process in convolutional networks differently than attention-based models and recurrent neural networks.
VL-BEiT: Generative Vision-Language Pretraining
TLDR
A vision-language foundation model called VL-BE I T, which is a bidirectional multimodal Transformer learned by generative pretraining, is introduced, which effectively leverages monomodal data like images and texts as well as multimodals data like image-text pairs.
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
TLDR
The Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes, is proposed, which incorporates the Temporal U-Net structure and incorporates an efficient depth-wise downsampling layer to sub-sample the input signal.
On Layer Normalizations and Residual Connections in Transformers
TLDR
This study investigates the reason for discrepant observations empirically and theoretically and proposes a method that can equip both higher stability and effective training by a simple modification from Post-LN, and demonstrates that the method outperforms Pre-Ln, and stable training regardless of the shallow or deep layer settings.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
NormFormer: Improved Transformer Pretraining with Extra Normalization
TLDR
The proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a LayerNorm after the first fully connected layer, improves pretraining perplexity and downstream task performance for both causal and masked language models.
Beyond English-Centric Multilingual Machine Translation
TLDR
This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.
Understanding the Difficulty of Training Transformers
TLDR
It is revealed that for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable since it amplifies small parameter perturbations and result in significant disturbances in the model output, yet a light dependency limits the potential of model training and can lead to an inferior trained model.
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
TLDR
The Flores-101 evaluation benchmark is introduced, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains that enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems.
Improving Transformer Optimization Through Better Initialization
TLDR
This work investigates and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; it proposes a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization, and achieves leading accuracy.
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
TLDR
It is argued that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures.
Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
TLDR
Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU, while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any
Learning Deep Transformer Models for Machine Translation
TLDR
It is claimed that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next.
Fixup Initialization: Residual Learning Without Normalization
TLDR
This work proposes fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization that enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.
...
...