GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation

  title={GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation},
  author={Jian Yang and Yuwei Yin and Shuming Ma and Haoyang Huang and Dongdong Zhang and Furu Wei and Zhoujun Li},
—Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation. However, vanilla Trans- former mainly exploits the top-layer representation, assuming the lower layers provide trivial or redundant information and thus ignoring the bottom-layer feature that is potentially valu-able. In this work, we propose the Group-Transformer model (GT RANS ) that flexibly divides multi-layer representations of both encoder and… 



Exploiting Deep Representations for Neural Machine Translation

This work proposes to simultaneously expose all of the top layers of encoder and decoder with layer aggregation and multi-layer attention mechanisms, and introduces an auxiliary regularization term to encourage different layers to capture diverse information.

Layer-Wise Multi-View Learning for Neural Machine Translation

This work proposes layer-wise multi-view learning to solve neural machine translation, circumventing the necessity to change the model structure, and can maintain the same inference speed as the original model.

Multi-layer Representation Fusion for Neural Machine Translation

This paper proposes a multi-layer representation fusion (MLRF) approach to fusing stacked layers and designs three fusion functions to learn a better representation from the stack in German-English translation.

Learning Deep Transformer Models for Machine Translation

It is claimed that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next.

Very Deep Transformers for Neural Machine Translation

It is shown that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers with outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.

Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders

A deep encoder with multiple shallow decoders (DEMSD) where each shallow decoder is responsible for a disjoint subset of target languages is proposed and is able to obtain a 1.8x speedup on average compared to a standard transformer model with no drop in translation quality.

Training Deeper Neural Machine Translation Models with Transparent Attention

A simple modification to the attention mechanism is proposed that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT’14 English-German and W MT’15 Czech-English tasks.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Multiscale Collaborative Deep Models for Neural Machine Translation

This paper presents a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously and provides empirical evidence showing that the MSC nets are easy to optimize and can obtain improvements of translation quality from considerably increased depth.

Neural Machine Translation with Deep Attention

A deep attention model (DeepAtt) is proposed that is capable of automatically determining what should be passed or suppressed from the corresponding encoder layer so as to make the distributed representation appropriate for high-level attention and translation.