• Corpus ID: 195069387

XLNet: Generalized Autoregressive Pretraining for Language Understanding

  title={XLNet: Generalized Autoregressive Pretraining for Language Understanding},
  author={Zhilin Yang and Zihang Dai and Yiming Yang and Jaime G. Carbonell and Ruslan Salakhutdinov and Quoc V. Le},
  booktitle={Neural Information Processing Systems},
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. [] Key Method In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its…

Figures and Tables from this paper

VarMAE: Pre-training of Variational Masked Autoencoder for Domain-adaptive Language Understanding

Experiments on science- and finance-domain NLU tasks demonstrate that VarMAE can be efficiently adapted to new domains with limited resources.

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

Inspired by the linearization exploration work of Elman, BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training, and the new model is adapted to different levels of language understanding required by downstream tasks.

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

The experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.

Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

This paper proposes the first large-scale language VAE model, Optimus, a universal latent embedding space for sentences that is first pre-trained on large text corpus, and then fine-tuned for various language generation and understanding tasks.

MPNet: Masked and Permuted Pre-training for Language Understanding

This paper proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.

Pretraining Deep Learning Models for Natural Language Understanding

This project conducted research to fully understand BERT and XLNet and applied their pretrained models to two language tasks: reading comprehension and part-of-speech tagging.

Transformers as Neural Augmentors: Class Conditional Sentence Generation via Variational Bayes

A neural data augmentation method, which is a combination of Conditional Variational Autoencoder and encoder-decoder Trans- former model, which increases the performance of current models compared to other data augmented techniques with a small amount of computation power.

Instance Regularization for Discriminative Language Model Pre-training

This work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training by estimating the corruption degree in the ennoising data construction process and the prediction confidence in the denoising counterpart.

Pseudolikelihood Reranking with Masked Language Models

These log-pseudolikelihood scores (LPLs) can outperform large, autoregressive language models (GPT-2) in out-of-the-box scoring and suggest that LPLs capture sentence fluency better than autore progressive scores.

Diformer: Directional Transformer for Neural Machine Translation

The Directional Transformer (Diformer) is proposed by jointly modelling AR and NAR into three generation directions with a newly introduced direction variable, which works by controlling the prediction of each token to have specific dependencies under that direction.



BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

MADE: Masked Autoencoder for Distribution Estimation

This work introduces a simple modification for autoencoder neural networks that yields powerful generative models and proves that this approach is competitive with state-of-the-art tractable distribution estimators.

Character-Level Language Modeling with Deeper Self-Attention

This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.

Multi-Task Deep Neural Networks for Natural Language Understanding

A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.

Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function

This paper develops a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results compared with more complex approaches, and shows the generality of the mixed objective function by improving the performance on relation extraction task.

MaskGAN: Better Text Generation via Filling in the ______

This work introduces an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context and shows qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.

Learned in Translation: Contextualized Word Vectors

Adding context vectors to a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.