PaLM: Scaling Language Modeling with Pathways

@article{Chowdhery2022PaLMSL,
  title={PaLM: Scaling Language Modeling with Pathways},
  author={Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek B Rao and Parker Barnes and Yi Tay and Noam M. Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Benton C. Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garc{\'i}a and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Oliveira Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark D{\'i}az and Orhan Firat and Michele Catasta and Jason Wei and Kathleen S. Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.02311}
}
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning , which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU… 
Impossible Triangle: What's Next for Pre-trained Language Models?
TLDR
It is argued that all ex-isting PLM models lack one or more properties from the Impossible Triangle, and insights into future research directions of PLMs to achieve the Impossible triangle are offered.
Large Language Models are Zero-Shot Reasoners
TLDR
Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.
Prefix Language Models are Unified Modal Learners
TLDR
It is demonstrated that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences and offered a welldefined benchmark for future research by reporting the performance on different scales of pre-training dataset on a heterogeneous and wide distribution coverage.
Flamingo: a Visual Language Model for Few-Shot Learning
TLDR
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
OPT: Open Pre-trained Transformer Language Models
TLDR
This work presents Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which they aim to fully and responsibly share with interested researchers.
On the Advance of Making Language Models Better Reasoners
TLDR
This paper conducts extensive experiments using the latest language model code-davinci-002 and demonstrates that D I V E RS E can achieve new state-of-the-art performance on six out of eight reasoning benchmarks, out-performing the PaLM model with 540B parameters.
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
TLDR
The proposed Language-Interfaced Fine-Tuning (LIFT) does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling “no-code machine learning with LMs,” and performs relatively well across a wide range of lowdimensional classification and regression tasks.
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing
TLDR
Limits of current techniques for effectively leveraging model scale for compositional generalization are highlighted, while the analysis also suggests promising directions for future work.
Instruction Induction: From Few Examples to Natural Language Task Descriptions
TLDR
It is discovered that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; this surprising result suggests that instruction induction might be a viable learning paradigm in and of itself.
On the Role of Bidirectionality in Language Model Pre-Training
TLDR
The results suggest that this approach to scaling has focused on left-to-right autoregressive models, and it might be worth-while to develop very large bidirectional models.
...
...

References

SHOWING 1-10 OF 187 REFERENCES
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
TLDR
This paper proposes and develops a family of language models named GLaM, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
Finetuned Language Models Are Zero-Shot Learners
TLDR
It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
Multitask Prompted Training Enables Zero-Shot Task Generalization
TLDR
A system for easily mapping any natural language tasks into a human-readable prompted form and fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
Designing Effective Sparse Expert Models
TLDR
This work scales a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B), and for the first time achieves state-of theart performance in transfer learning.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
TLDR
The infrastructure as well as the 3D parallelism methodology used to train the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters is presented.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
TLDR
This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
...
...