Corpus ID: 235248419

Knowledge Inheritance for Pre-trained Language Models

  title={Knowledge Inheritance for Pre-trained Language Models},
  author={Yujia Qin and Yankai Lin and Jing Yi and Jiajie Zhang and Xu Han and Zhengyan Zhang and Yusheng Su and Zhiyuan Liu and Peng Li and Maosong Sun and Jie Zhou},
Recent explorations of large-scale pre-trained language models (PLMs) such as GPT-3 have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, training a large-scale PLM requires tremendous amounts of computational resources, which is timeconsuming and expensive. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring the availability of many existing welltrained PLMs. To this end, we explore… Expand
CPM-2: Large-scale Cost-effective Pre-trained Language Models
A suite of costeffective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference are presented and knowledge inheritance is introduced to accelerate the pretraining process by exploiting existing PLMs instead of training models from scratch. Expand
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic. Expand
bert2BERT: Towards Reusable Pretrained Language Models
  • Cheng Chen, Yichun Yin, +7 authors Qun Liu
  • Computer Science
  • 2021
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computationalExpand


Patient Knowledge Distillation for BERT Model Compression
This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy. Expand
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
This work proposes a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency. Expand
ERNIE: Enhanced Language Representation with Informative Entities
This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
Efficient Training of BERT by Progressively Stacking
This paper proposes the stacking algorithm to transfer knowledge from a shallow model to a deep model; then it applies stacking progressively to accelerate BERT training, and shows that the models trained by the training strategy achieve similar performance to models trained from scratch, but the algorithm is much faster. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Improving Language Understanding by Generative Pre-Training
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
The experimental results demonstrate the superior capabilities of PanGu-α in performing various tasks under few-shot or zero-shot settings and investigate the effect of model scales on the few- shot performances across a broad range of Chinese NLP tasks. Expand
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Expand