• Corpus ID: 146808476

MASS: Masked Sequence to Sequence Pre-training for Language Generation

  title={MASS: Masked Sequence to Sequence Pre-training for Language Generation},
  author={Kaitao Song and Xu Tan and Tao Qin and Jianfeng Lu and Tie-Yan Liu},
  booktitle={International Conference on Machine Learning},
Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. [] Key Method MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment.

Figures and Tables from this paper

ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation

An enhanced multi-flow sequence to sequence pre-training and fine-tuning framework named ERNIE-GEN, which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method to make generation closer to human writing patterns.

PRINCE: Prefix-Masked Decoding for Knowledge Enhanced Sequence-to-Sequence Pre-Training

A simple yet effective pre-training paradigm, equipped with a knowledge-enhanced decoder that predicts the next entity token with noises in the prefix, explicitly strengthening the representation learning of entities that span over multiple input tokens.

Cross-Lingual Natural Language Generation via Pre-Training

Experimental results on question generation and abstractive summarization show that the model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation and improves NLG performance of low-resource languages by leveraging rich-resource language data.

Distilling Knowledge Learned in BERT for Text Generation

A novel approach, Conditional Masked Language Modeling (C-MLM), is presented to enable the finetuning of BERT on target generation tasks, which significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization.

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Three pre-training tasks are introduced that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks.

Deep Fusing Pre-trained Models into Neural Machine Translation

A novel framework to deep fuse the pre-trained representation into NMT, fully exploring the potential of PTMs in NMT is proposed and outperforms previous work in both autoregressive and non-autoregressive NMT models.

Unified Language Model Pre-training for Natural Language Understanding and Generation

A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

DeltaLM (∆LM) is introduced, a pretrained multilingual encoderdecoder model that regards the decoder as the task layer of off-the-shelf pretrained encoders and outperforms various strong baselines on both natural language generation and translation tasks.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.

MPNet: Masked and Permuted Pre-training for Language Understanding

This paper proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.



BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Transfer Learning for Low-Resource Neural Machine Translation

A transfer learning method is presented that significantly improves Bleu scores across a range of low-resource languages by first train a high-resource language pair, then transfer some of the learned parameters to the low- resource pair to initialize and constrain training.

Unsupervised Pretraining for Sequence to Sequence Learning

This work presents a general unsupervised learning method to improve the accuracy of sequence to sequence (seq2seq) models by pretraining the weights of the encoder and decoder with the pretrained weights of two language models and then fine-tuned with labeled data.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

MaskGAN: Better Text Generation via Filling in the ______

This work introduces an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context and shows qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Neural Machine Translation by Jointly Learning to Align and Translate

It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

Unsupervised Neural Machine Translation with Weight Sharing

This work introduces an extension by utilizing two independent encoders but sharing some partial weights which are responsible for extracting high-level representations of the input sentences, which achieves significant improvements on English-German, English-French and Chinese-to-English translation tasks.

Phrase-Based & Neural Unsupervised Machine Translation

This work investigates how to learn to translate when having access to only large monolingual corpora in each language, and proposes two model variants, a neural and a phrase-based model, which are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters.

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.