Corpus ID: 198953378

RoBERTa: A Robustly Optimized BERT Pretraining Approach

@article{Liu2019RoBERTaAR,
  title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
  author={Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov},
  journal={ArXiv},
  year={2019},
  volume={abs/1907.11692}
}
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. [...] Key Result These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks
TLDR
It is shown that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%. Expand
AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models
TLDR
This paper carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints and proposes a more efficient development method that is even faster than the development of a single PLM. Expand
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
TLDR
This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks. Expand
bert2BERT: Towards Reusable Pretrained Language Models
  • Cheng Chen, Yichun Yin, +7 authors Qun Liu
  • Computer Science
  • 2021
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computationalExpand
Benchmarking down-scaled (not so large) pre-trained language models
TLDR
Pre-train down-scaled versions of several popular Transformer-based architectures on a common pre-training corpus and benchmark them on a subset of the GLUE tasks, finding that additional compute should be mainly allocated to an increased model size, while training for more steps is inefficient. Expand
FastBERT: a Self-distilling BERT with Adaptive Inference Time
TLDR
A novel speed-tunable FastBERT with adaptive inference time that is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff. Expand
Poor Man's BERT: Smaller and Faster Transformer Models
TLDR
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance. Expand
Improving BERT with Self-Supervised Attention
TLDR
This paper proposes a novel technique, called Self-Supervised Attention (SSA), which automatically generates weak, token-level attention labels iteratively by "probing" the fine-tuned model from the previous iteration. Expand
MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models
TLDR
This work presents a method that leverages BERT's fine-tuning phase to its fullest, by applying an extensive number of parallel classifier heads, which are enforced to be orthogonal, while adaptively eliminating the weaker heads during training. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 54 REFERENCES
Cloze-driven Pretraining of Self-attention Networks
TLDR
A new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems and a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective are presented. Expand
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
Reducing BERT Pre-Training Time from 3 Days to 76 Minutes
TLDR
The LAMB optimizer is proposed, which helps to scale the batch size to 65536 without losing accuracy, and is a general optimizer that works for both small and large batch sizes and does not need hyper-parameter tuning besides the learning rate. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
SpanBERT: Improving Pre-training by Representing and Predicting Spans
TLDR
The approach extends BERT by masking contiguous random spans, rather than random tokens, and training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Expand
Automatic differentiation in PyTorch
TLDR
An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Expand
Scaling Neural Machine Translation
TLDR
This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
A Surprisingly Robust Trick for the Winograd Schema Challenge
TLDR
This paper shows that the performance of three language models on WSC273 strongly improves when fine-tuned on a similar pronoun disambiguation problem dataset (denoted WSCR), and generates a large unsupervised WSC-like dataset. Expand
...
1
2
3
4
5
...