On the Copying Behaviors of Pre-Training for Neural Machine Translation

  title={On the Copying Behaviors of Pre-Training for Neural Machine Translation},
  author={Xuebo Liu and Longyue Wang and Derek F. Wong and Liang Ding and Lidia S. Chao and Shuming Shi and Zhaopeng Tu},
Previous studies have shown that initializing neural machine translation (NMT) models with the pre-trained language models (LM) can speed up the model training and boost the model performance. In this work, we identify a critical side-effect of pre-training for NMT, which is due to the discrepancy between the training objectives of LM-based pre-training and NMT. Since the LM objective learns to reconstruct a few source tokens and copy most of them, the pre-training initialization would affect… Expand

Figures and Tables from this paper

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation
Experimental results show that PT and BT are nicely complementary to each other, establishing state-ofthe-art performances on the WMT16 EnglishRomanian and English-Russian benchmarks. Expand
Improving Similar Language Translation With Transfer Learning
This work investigates transfer learning based on pretrained neural machine translation models to translate between (low-resource) similar languages to find models that rank top 1 in the official shared task evaluation. Expand


Towards Making the Most of BERT in Neural Machine Translation
This work introduces a concerted training framework (\method) that is the key to integrate the pre-trained LMs to neural machine translation (NMT) and consists of three techniques: asymptotic distillation to ensure that the NMT model can retain the previous pre- trained knowledge. Expand
CSP: Code-Switching Pre-training for Neural Machine Translation
Experimental results show that CSP achieves significant improvements over baselines without pre- training or with other pre-training methods, and relieve the pretrain-finetune discrepancy caused by the artificial symbols like [mask]. Expand
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models. Expand
Incorporating Copying Mechanism in Sequence-to-Sequence Learning
This paper incorporates copying into neural network-based Seq2Seq learning and proposes a new model called CopyNet with encoder-decoder structure which can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence. Expand
Neural Machine Translation by Jointly Learning to Align and Translate
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. Expand
Multilingual Denoising Pre-training for Neural Machine Translation
Abstract This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—aExpand
Analyzing Uncertainty in Neural Machine Translation
This study proposes tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations and shows that search works remarkably well but that models tend to spread too much probability mass over the hypothesis space. Expand
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Incorporating BERT into Neural Machine Translation
A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. Expand