BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

  title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension},
  author={Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Veselin Stoyanov and Luke Zettlemoyer},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of… 

Figures and Tables from this paper

Rethinking Denoised Auto-Encoding in Language Pre-Training

The proposed ContrAstive Pre-Training (CAPT) encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals, and aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision.

Pre-training via Paraphrasing

It is shown that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6% absolute gain on GLUE benchmarks and 0.8% absolute increment on NLVR.

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

This work proposes a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem, and improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

This work proposes to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-pre precision student model, and presents the performance-efficiency trade-off for generative tasks using pre-trained models.

Universal Conditional Masked Language Pre-training for Neural Machine Translation

This paper proposes CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora in many languages, and is the first work to pre-train a unified model for fine-tuning on both NMT tasks.

BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

BARThez is introduced, the first large-scale pretrained seq2seq model for French, which is particularly well-suited for generative tasks and is shown to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT.

Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation

Experimental results show that the in-domain pretraining and input adaptation approach can consistently improve both translation performance and model robustness upon Seq2Seq pretraining.

Text Simplification by Tagging

TST is presented, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds.

ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation

An enhanced multi-flow sequence to sequence pre-training and fine-tuning framework named ERNIE-GEN, which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method to make generation closer to human writing patterns.



MASS: Masked Sequence to Sequence Pre-training for Language Generation

This work proposes MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks, which achieves the state-of-the-art accuracy on the unsupervised English-French translation, even beating the early attention-based supervised model.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Unified Language Model Pre-training for Natural Language Understanding and Generation

A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.

Text Summarization with Pretrained Encoders

This paper introduces a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences and proposes a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Pre-trained language model representations for language generation

This paper examines different strategies to integrate pre-trained representations into sequence to sequence models and applies it to neural machine translation and abstractive summarization and finds that pre- trained representations are most effective when added to the encoder network which slows inference by only 14%.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Get To The Point: Summarization with Pointer-Generator Networks

A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.