Corpus ID: 235731579

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

  title={ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation},
  author={Yu Sun and Shuohuan Wang and Shikun Feng and Siyu Ding and Chao Pang and Junyuan Shang and Jiaxiang Liu and Xuyi Chen and Yanbin Zhao and Yuxiang Lu and Weixin Liu and Zhihua Wu and Weibao Gong and Jianzhong Liang and Zhizhou Shang and Peng Sun and Wei Liu and Xuan Ouyang and Dianhai Yu and Hao Tian and Hua Wu and Haifeng Wang},
Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 [1] and GPT-3 [2] have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as… Expand
MvSR-NAT: Multi-view Subset Regularization for Non-Autoregressive Machine Translation
Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the NAT model, which achieves remarkable performance on three public benchmarks with 0.36-1.14 BLEU gains over previous NAT models. Expand
PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation
The experimental results indicate that PLATO-XL obtains state-of-the-art results across multiple conversational tasks, verifying its potential as a foundation model of conversational AI. Expand
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
  • Yunfan Shao, Zhichao Geng, +5 authors Xipeng Qiu
  • Computer Science
  • ArXiv
  • 2021
The unbalanced Transformer saves the computational and storage cost, which makes CPT competitive and greatly accelerates the inference of text generation. Expand
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
  • Junyang Lin, An Yang, +9 authors Hongxia Yang
  • Computer Science
  • 2021
This paper demonstrates a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days, and provides a technique, Granular CPU offloading, to manage CPU memory for training large model and maintain high GPU utilities. Expand
MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators
  • Zhixing Tan, Xiangwen Zhang, Shuo Wang, Yang Liu
  • Computer Science
  • 2021
Pre-trained language models have recently been shown to be able to perform translation without finetuning via prompting. Inspired by these findings, we study improving the performance of pre-trainedExpand
Y UAN 1 . 0 : L ARGES
  • Xudong Zhao, Tong Yu, +7 authors Xuanwei Zhang
  • 2021
Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot and Few-Shot learning on many natural language processing (NLP) tasks by scaling up model size, dataset size and the amountExpand


Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
This work proposes a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities, and consistently outperforms BERT on four entity-related question answering datasets. Expand
CPM: A Large-scale Generative Chinese Pre-trained Language Model
CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Expand
ERNIE: Enhanced Language Representation with Informative Entities
This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks. Expand
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
A continual pre-training framework named ERNIE 2.0 which builds and learns incrementally pre- Training corpora tasks through constant multi-task learning is proposed which outperforms BERT and XLNet on 16 tasks. Expand
ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders
This paper proposes to pretrain n-gram-enhanced encoders with a large volume of data and advanced techniques for training and tries to extend the encoder to different languages as well as different domains, where it is confirmed that the same architecture is applicable to these varying circumstances. Expand
CPM-2: Large-scale Cost-effective Pre-trained Language Models
A suite of costeffective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference are presented and knowledge inheritance is introduced to accelerate the pretraining process by exploiting existing PLMs instead of training models from scratch. Expand
CoLAKE: Contextualized Language and Knowledge Embedding
The Contextualized Language and Knowledge Embedding (CoLAKE) is proposed, which jointly learns contextualized representation for both language and knowledge with the extended MLM objective, and achieves surprisingly high performance on a synthetic task called word-knowledge graph completion, which shows the superiority of simultaneously contextualizing language andknowledge representation. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
The experimental results demonstrate the superior capabilities of PanGu-α in performing various tasks under few-shot or zero-shot settings and investigate the effect of model scales on the few- shot performances across a broad range of Chinese NLP tasks. Expand
Improving Language Understanding by Generative Pre-Training
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand