On the Role of Bidirectionality in Language Model Pre-Training

  title={On the Role of Bidirectionality in Language Model Pre-Training},
  author={Mikel Artetxe and Jingfei Du and Naman Goyal and Luke Zettlemoyer and Ves Stoyanov},
Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, in-cluding fully… 

Language Models are General-Purpose Interfaces

This work proposes a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders, and subsume the advantages and capabilities from both causal and non-causing modeling, thereby combining the best of two worlds.

Efficient Training of Language Models to Fill in the Middle

There is extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales.

FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners

Experimental results show that the proposed technique improves PaLM’s zero and few-shot performance on a diverse suite of tasks, including commonsense reasoning, natural language inference and cloze completion, and also helps representation learning.

Nonparametric Masked Language Modeling

It is shown that N P M can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval, and is particularly better on dealing with rare patterns (word senses or facts), and predicting rare or nearly unseen words.

Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in U.S. Presidential Campaigns (1952–2020) with Neural Language Models

Radical-right campaigns commonly employ three discursive elements: anti-elite populism, exclusionary and declinist nationalism, and authoritarianism. Recent scholarship has explored whether these



BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

A large-scale evaluation of modeling choices and their impact on zero-shot generalization of large pretrained Transformer language models focuses on text-to-text models and shows that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero- shot generalization after purely self-supervised pretraining.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

A unified framework named ERNIE 3.0 is proposed for pre-training large-scale knowledge enhanced models that fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few- shot learning or fine-tuning.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

Efficient Large Scale Language Modeling with Mixtures of Experts

This paper presents a de-tailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot tuning.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

This work proposes a method that incorporates large-scale distributed training performance into model architecture design and achieves excellent performance on thousands GPUs during training, and the state-of-the-art results on NLP tasks.

PaLM: Scaling Language Modeling with Pathways

A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.