GLM-130B: An Open Bilingual Pre-trained Model

  title={GLM-130B: An Open Bilingual Pre-trained Model},
  author={Aohan Zeng and Xiao Liu and Zhengxiao Du and Zihan Wang and Hanyu Lai and Ming Ding and Zhuoyi Yang and Yifan Xu and Wendi Zheng and Xiao Xia and Weng Lam Tam and Zixuan Ma and Yufei Xue and Jidong Zhai and Wenguang Chen and P. Zhang and Yuxiao Dong and Jie Tang},
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and disconvergence. In this paper, we introduce the training process of GLM-130B including its design choices… 

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM is a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers and achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning.

What Language Model to Train if You Have One Million GPU Hours?

An ablation study at the billion-parameter scale compar-ing different modeling practices and their impact on zero-shot generalization is performed and the performance of a multilingual model and how it compares to the English-only one is studied.

Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations

Evaluation shows that Z- ICL outperforms previous zero-shot methods by a significant mar-gin, and is on par with in-context learning with labeled training data in the few-shot setting.

On the Inconsistencies of Conditionals Learned by Masked Language Models

It is shown that bidirectional conditionals learned by BERT-style MLMs often demonstrate considerable inconsistencies, i.e., they can not be derived from a coherent joint distribution when considered together, which means that T5-styleMLMs capable of infilling will generate discrepant results depending on how much masks are given, which may represent a particular trust issue.

Large Language Models Are Human-Level Prompt Engineers

It is shown that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

A procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance and makes such models much more accessible.

The case for 4-bit precision: k-bit Inference Scaling Laws

Theings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy, and that it is challenging to improve the bit-level scaling trade-off.

In Defense of Cross-Encoders for Zero-Shot Retrieval


GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation

This work introduces GENIUS, a conditional text model, which can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks and proposes GeniusAug, which extracts the target-aware sketches from the original training set and then generates new samples based on the sketches.

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

The design decisions of publicly available instruction tuning methods are studied, and the development of Flan 2022 is broken down, showing Flan-T5 requires less instruction-tuned models as more computationally-efficient starting checkpoints for new tasks.



ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

A hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters is trained, which is the largest Chinese dense pre-trained model so far and outperforms the state-of-the-art models on 68 NLP datasets.

OPT: Open Pre-trained Transformer Language Models

This work presents Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which they aim to fully and responsibly share with interested researchers.

Unified Language Model Pre-training for Natural Language Understanding and Generation

A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.

CLUE: A Chinese Language Understanding Evaluation Benchmark

The first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark is introduced, an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

This work proposes a method that incorporates large-scale distributed training performance into model architecture design and achieves excellent performance on thousands GPUs during training, and the state-of-the-art results on NLP tasks.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

ExMIX (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families is introduced, and a model pre-trained using a multi-task objective of self-supervised span denoising and supervised EXMIX is proposed.

Finetuned Language Models Are Zero-Shot Learners

It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Efficient Large Scale Language Modeling with Mixtures of Experts

This paper presents a de-tailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot tuning.