On the Copying Behaviors of Pre-Training for Neural Machine Translation

  title={On the Copying Behaviors of Pre-Training for Neural Machine Translation},
  author={Xuebo Liu and Longyue Wang and Derek F. Wong and Liang Ding and Lidia S. Chao and Shuming Shi and Zhaopeng Tu},
Previous studies have shown that initializing neural machine translation (NMT) models with the pre-trained language models (LM) can speed up the model training and boost the model performance. In this work, we identify a critical side-effect of pre-training for NMT, which is due to the discrepancy between the training objectives of LM-based pre-training and NMT. Since the LM objective learns to reconstruct a few source tokens and copy most of them, the pre-training initialization would affect… 

Figures and Tables from this paper

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation
Experimental results show that PT and BT are nicely complementary to each other, establishing state-ofthe-art performances on the WMT16 EnglishRomanian and English-Russian benchmarks.
Tencent Translation System for the WMT21 News Translation Task
  • Longyue Wang, Mu Li, +6 authors Wen Zhang
  • Computer Science
  • 2021
Tencent Translation systems for the WMT21 shared task combines different data augmentation methods including back-translation, forward-translation and right-to-left training to enlarge the training data and proposes a fine-grained “one model one domain” approach to model characteristics of different news genres at fine-tuning and decoding stages.
Improving Similar Language Translation With Transfer Learning
This work investigates transfer learning based on pre-trained neural machine translation models to translate between (low-resource) similar languages to create models for French-Bambara and Portuguese-Spanish pairs.


Towards Making the Most of BERT in Neural Machine Translation
This work introduces a concerted training framework (\method) that is the key to integrate the pre-trained LMs to neural machine translation (NMT) and consists of three techniques: asymptotic distillation to ensure that the NMT model can retain the previous pre- trained knowledge.
CSP: Code-Switching Pre-training for Neural Machine Translation
Experimental results show that CSP achieves significant improvements over baselines without pre- training or with other pre-training methods, and relieve the pretrain-finetune discrepancy caused by the artificial symbols like [mask].
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Incorporating Copying Mechanism in Sequence-to-Sequence Learning
This paper incorporates copying into neural network-based Seq2Seq learning and proposes a new model called CopyNet with encoder-decoder structure which can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence.
Neural Machine Translation by Jointly Learning to Align and Translate
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Multilingual Denoising Pre-training for Neural Machine Translation
Abstract This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a
Analyzing Uncertainty in Neural Machine Translation
This study proposes tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations and shows that search works remarkably well but that models tend to spread too much probability mass over the hypothesis space.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Incorporating BERT into Neural Machine Translation
A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms.