Corpus ID: 233219888

Lessons on Parameter Sharing across Layers in Transformers

  title={Lessons on Parameter Sharing across Layers in Transformers},
  author={Sho Takase and Shun Kiyono},
We propose a novel parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares the parameters of one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to improve the efficiency. We propose three strategies: SEQUENCE, CYCLE, and CYCLE (REV) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in terms of the parameter size and computational… Expand

Figures and Tables from this paper

On Compositional Generalization of Neural Machine Translation
This paper quantitatively analyze effects of various factors using compound translation error rate, then demonstrates that the NMT model fails badly on compositional generalization, although it performs remarkably well under traditional metrics. Expand


Sharing Attention Weights for Fast Transformer
This paper speed up Transformer via a fast and lightweight attention model and share attention weights in adjacent layers and enable the efficient re-use of hidden states in a vertical manner. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Transformers without Tears: Improving the Normalization of Self-Attention
It is shown that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large learning rates, and a single scale parameter (ScaleNorm) is proposed for faster training and better performance. Expand
Universal Transformers
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed. Expand
Learning Deep Transformer Models for Machine Translation
It is claimed that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. Expand
Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder
This work considers model-level sharing and ties the whole parts of the encoder and decoder of an NMT model, and obtains a compact model named Tied Transformer, which demonstrates that such a simple method works well for both similar and dissimilar language pairs. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
Direct Output Connection for a High-Rank Language Model
This paper proposes a state-of-the-art recurrent neural network (RNN) language model that combines probability distributions computed not only from a final RNN layer but also middle layers, and indicates the proposed method contributes to application tasks: machine translation and headline generation. Expand
Recurrent Stacking of Layers for Compact Neural Machine Translation Models
It is empirically show that the translation quality of a model that recurrently stacks a single layer 6 times is comparable to the translationquality of a models that stacks 6 separate layers. Expand