• Corpus ID: 198953378

RoBERTa: A Robustly Optimized BERT Pretraining Approach

@article{Liu2019RoBERTaAR,
  title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
  author={Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov},
  journal={ArXiv},
  year={2019},
  volume={abs/1907.11692}
}
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. [] Key Result These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Tables from this paper

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks
TLDR
It is shown that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%.
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
TLDR
This work conducts a comprehensive empirical study, and proposes a recipe, namely “Model generated dEnoising TRaining Objective” (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness.
AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models
TLDR
This paper carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints and proposes a more efficient development method that is even faster than the development of a single PLM.
bert2BERT: Towards Reusable Pretrained Language Models
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
TLDR
This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks.
FastBERT: a Self-distilling BERT with Adaptive Inference Time
TLDR
A novel speed-tunable FastBERT with adaptive inference time that is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.
Meta-Learning to Improve Pre-Training
TLDR
An efficient, gradient-based algorithm to meta-learn PT hyperparameter gradients by combining implicit differentiation and backpropagation through unrolled optimization is proposed and it is demonstrated that the method improves predictive performance on two real-world domains.
Debiasing Pre-Trained Language Models via Efficient Fine-Tuning
An explosion in the popularity of transformer-based language models (such as GPT-3, BERT, RoBERTa, and ALBERT) has opened the doors to new machine learning applications involving language modeling,
Poor Man's BERT: Smaller and Faster Transformer Models
TLDR
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
Cloze-driven Pretraining of Self-attention Networks
TLDR
A new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems, including cloze-style word reconstruction task, and a detailed analysis of a number of factors that contribute to effective pretraining.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Reducing BERT Pre-Training Time from 3 Days to 76 Minutes
TLDR
The LAMB optimizer is proposed, which helps to scale the batch size to 65536 without losing accuracy, and is a general optimizer that works for both small and large batch sizes and does not need hyper-parameter tuning besides the learning rate.
SpanBERT: Improving Pre-training by Representing and Predicting Spans
TLDR
The approach extends BERT by masking contiguous random spans, rather than random tokens, and training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
Automatic differentiation in PyTorch
TLDR
An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Scaling Neural Machine Translation
TLDR
This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
A Surprisingly Robust Trick for the Winograd Schema Challenge
TLDR
This paper shows that the performance of three language models on WSC273 strongly improves when fine-tuned on a similar pronoun disambiguation problem dataset (denoted WSCR), and generates a large unsupervised WSC-like dataset.
...
1
2
3
4
5
...