Corpus ID: 202888986

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

@article{Lan2020ALBERTAL,
  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
  journal={ArXiv},
  year={2020},
  volume={abs/1909.11942}
}
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. [...] Key Method Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on…Expand
MC-BERT: Efficient Language Pre-Training via a Meta Controller
TLDR
Results over GLUE natural language understanding benchmark demonstrate that the proposed MC-BERT method is both efficient and effective: it outperforms baselines on GLUE semantic tasks given the same computational budget. Expand
Poor Man's BERT: Smaller and Faster Transformer Models
TLDR
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance. Expand
RefBERT: Compressing BERT by Referencing to Pre-computed Representations
TLDR
RefBERT is proposed to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model, which is 7.4x smaller and 9.5x faster on inference than BERTBASE. Expand
Compressing Pre-trained Language Models by Matrix Decomposition
TLDR
A two-stage model-compression method to reduce a model’s inference time cost by first decomposing the matrices in the model into smaller matrices and then performing feature distillation on the internal representation to recover from the decomposition. Expand
Undivided Attention: Are Intermediate Layers Necessary for BERT?
TLDR
This work shows that reducing the number of intermediate layers and modifying the architecture for BERTBASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing thenumber of parameters and training time of the model. Expand
Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning
TLDR
A novel deep bidirectional language model called a Transformer-based Text Autoencoder (T-TA) is proposed, which computes contextual language representations without repetition and displays the benefits of a deepbidirectional architecture, such as that of BERT. Expand
Towards Effective Utilization of Pre-trained Language Models
  • Linqing Liu
  • 2020
In the natural language processing (NLP) literature, neural networks are becoming increasingly deeper and more complex. Recent advancements in neural NLP are large pretrained language models (e.g.Expand
AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models
TLDR
This paper carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints and proposes a more efficient development method that is even faster than the development of a single PLM. Expand
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
TLDR
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic. Expand
Hierarchical Multitask Learning Approach for BERT
TLDR
This work adopts hierarchical multitask learning approaches for BERT pre- training, and shows that imposing a task hierarchy in pre-training improves the performance of embeddings. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 66 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
TLDR
It is shown that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Expand
Efficient Training of BERT by Progressively Stacking
TLDR
This paper proposes the stacking algorithm to transfer knowledge from a shallow model to a deep model; then it applies stacking progressively to accelerate BERT training, and shows that the models trained by the training strategy achieve similar performance to models trained from scratch, but the algorithm is much faster. Expand
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
TLDR
Inspired by the linearization exploration work of Elman, BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training, and the new model is adapted to different levels of language understanding required by downstream tasks. Expand
Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation
TLDR
It is observed that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
TLDR
This work proposes using knowledge distillation where single- task models teach a multi-task model, and enhances this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi- task model surpass its single-task teachers. Expand
Adaptive Input Representations for Neural Language Modeling
TLDR
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed. Expand
...
1
2
3
4
5
...