• Corpus ID: 202888986

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

@article{Lan2020ALBERTAL,
  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
  journal={ArXiv},
  year={2020},
  volume={abs/1909.11942}
}
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better… 

Figures and Tables from this paper

Poor Man's BERT: Smaller and Faster Transformer Models
TLDR
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.
bert2BERT: Towards Reusable Pretrained Language Models
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational
MC-BERT: Efficient Language Pre-Training via a Meta Controller
TLDR
Results over GLUE natural language understanding benchmark demonstrate that the proposed MC-BERT method is both efficient and effective: it outperforms baselines on GLUE semantic tasks given the same computational budget.
Pre-training a BERT with Curriculum Learning by Increasing Block-Size of Input Text
TLDR
A new CL method is proposed which gradually increases the block-size of input text for training the self-attention mechanism of BERT and its variants using the maximum available batch-size and outperforms the baseline in terms of convergence speed and final performance on downstream tasks.
ConvBERT: Improving BERT with Span-based Dynamic Convolution
TLDR
A novel span-based dynamic convolution is proposed to replace these self-attention heads to directly model local dependencies and form a new mixed attention block that is more efficient at both global and local context learning.
Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences
TLDR
It is posited that while LMPT can effectively model pertoken relations, it fails at modeling per-sequence relations in non-natural language domains, and a framework is developed that couples LMPt with deep structure-preserving metric learning to produce richer embeddings than can be obtained from L MPT alone.
Compressing Pre-trained Language Models by Matrix Decomposition
TLDR
A two-stage model-compression method to reduce a model’s inference time cost by first decomposing the matrices in the model into smaller matrices and then performing feature distillation on the internal representation to recover from the decomposition.
RefBERT: Compressing BERT by Referencing to Pre-computed Representations
TLDR
RefBERT is proposed to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model, which is 7.4x smaller and 9.5x faster on inference than BERTBASE.
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
TLDR
This work presents SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling that exhibits strong modeling capacity and training efficiency and suggests jointly leveragingFast recurrence with little attention as a promising direction for accelerating model training and inference.
Extremely Small BERT Models from Mixed-Vocabulary Training
TLDR
This method compresses BERT-LARGE to a task-agnostic model with smaller vocabulary and hidden dimensions, which is an order of magnitude smaller than other distilled BERT models and offers a better size-accuracy trade-off on language understanding benchmarks as well as a practical dialogue task.
...
...

References

SHOWING 1-10 OF 83 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
TLDR
It is shown that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work.
Efficient Training of BERT by Progressively Stacking
TLDR
This paper proposes the stacking algorithm to transfer knowledge from a shallow model to a deep model; then it applies stacking progressively to accelerate BERT training, and shows that the models trained by the training strategy achieve similar performance to models trained from scratch, but the algorithm is much faster.
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
TLDR
Inspired by the linearization exploration work of Elman, BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training, and the new model is adapted to different levels of language understanding required by downstream tasks.
Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation
TLDR
It is observed that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
TLDR
This work proposes using knowledge distillation where single- task models teach a multi-task model, and enhances this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi- task model surpass its single-task teachers.
Adaptive Input Representations for Neural Language Modeling
TLDR
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.
...
...