• Corpus ID: 202888986

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better… 

Figures and Tables from this paper

Poor Man's BERT: Smaller and Faster Transformer Models

A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Results over GLUE natural language understanding benchmark demonstrate that the proposed MC-BERT method is both efficient and effective: it outperforms baselines on GLUE semantic tasks given the same computational budget.

Pre-training a BERT with Curriculum Learning by Increasing Block-Size of Input Text

A new CL method is proposed which gradually increases the block-size of input text for training the self-attention mechanism of BERT and its variants using the maximum available batch-size and outperforms the baseline in terms of convergence speed and final performance on downstream tasks.

Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences

It is posited that while LMPT can effectively model pertoken relations, it fails at modeling per-sequence relations in non-natural language domains, and a framework is developed that couples LMPt with deep structure-preserving metric learning to produce richer embeddings than can be obtained from L MPT alone.

Compressing Pre-trained Language Models by Matrix Decomposition

A two-stage model-compression method to reduce a model’s inference time cost by first decomposing the matrices in the model into smaller matrices and then performing feature distillation on the internal representation to recover from the decomposition.

RefBERT: Compressing BERT by Referencing to Pre-computed Representations

RefBERT is proposed to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model, which is 7.4x smaller and 9.5x faster on inference than BERTBASE.

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

This work presents SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling that exhibits strong modeling capacity and training efficiency and suggests jointly leveragingFast recurrence with little attention as a promising direction for accelerating model training and inference.

Extremely Small BERT Models from Mixed-Vocabulary Training

This method compresses BERT-LARGE to a task-agnostic model with smaller vocabulary and hidden dimensions, which is an order of magnitude smaller than other distilled BERT models and offers a better size-accuracy trade-off on language understanding benchmarks as well as a practical dialogue task.

Undivided Attention: Are Intermediate Layers Necessary for BERT?

This work shows that reducing the number of intermediate layers and modifying the architecture for BERTBASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing thenumber of parameters and training time of the model.

Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

A novel deep bidirectional language model called a Transformer-based Text Autoencoder (T-TA) is proposed, which computes contextual language representations without repetition and displays the benefits of a deepbidirectional architecture, such as that of BERT.



RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

It is shown that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work.

Efficient Training of BERT by Progressively Stacking

This paper proposes the stacking algorithm to transfer knowledge from a shallow model to a deep model; then it applies stacking progressively to accelerate BERT training, and shows that the models trained by the training strategy achieve similar performance to models trained from scratch, but the algorithm is much faster.

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

Inspired by the linearization exploration work of Elman, BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training, and the new model is adapted to different levels of language understanding required by downstream tasks.

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

It is observed that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

This work proposes using knowledge distillation where single- task models teach a multi-task model, and enhances this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi- task model surpass its single-task teachers.

Adaptive Input Representations for Neural Language Modeling

Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.

Patient Knowledge Distillation for BERT Model Compression

This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.