Corpus ID: 221879173

Multi-Pass Transformer for Machine Translation

  title={Multi-Pass Transformer for Machine Translation},
  author={Peng Gao and Chiori Hori and Shijie Geng and Takaaki Hori and Jonathan Le Roux},
In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers. To maintain a directed acyclic graph structure, the encoder stack of a transformer is repeated along a new multi-pass dimension, keeping the parameters tied, and information is allowed to proceed unidirectionally both towards deeper layers… Expand

Figures and Tables from this paper

Scalable Transformers for Neural Machine Translation
A three-stage training scheme is proposed to tackle the difficulty of training the Scalable Transformers, which introduces additional supervisions from word-level and sequence-level self-distillation. Expand
Oriented Object Detection with Transformer
The first attempt and implement Oriented Object DEtection with TRansformer (ODETR) based on an end-to-end network and can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet. Expand
RomeBERT: Robust Training of Multi-Exit BERT
BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration,Expand


Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement
This paper proposes to use routing-by-agreement strategies to aggregate layers dynamically and shows that the proposed approach consistently outperforms the strong baseline model and a representative static aggregation model. Expand
Multiscale Collaborative Deep Models for Neural Machine Translation
This paper presents a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously and provides empirical evidence showing that the MSC nets are easy to optimize and can obtain improvements of translation quality from considerably increased depth. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
The Evolved Transformer
The Progressive Dynamic Hurdles method is developed, which allows us to dynamically allocate more resources to more promising candidate models on the computationally expensive WMT 2014 English-German translation task, and demonstrates consistent improvement over the Transformer on four well-established language tasks. Expand
Exploiting Sentential Context for Neural Machine Translation
It is shown that a shallow sentential context extracted from the top encoder layer only, can improve translation performance via contextualizing the encoding representations of individual words. Expand
Dual Path Networks
This work reveals the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, and finds that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. Expand
Convolutional Sequence to Sequence Learning
This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU. Expand
Densely Connected Convolutional Networks
The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. Expand
Neural Architecture Search with Reinforcement Learning
This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. Expand