Scaling Instruction-Finetuned Language Models

  title={Scaling Instruction-Finetuned Language Models},
  author={Hyung Won Chung and Le Hou and S. Longpre and Barret Zoph and Yi Tay and William Fedus and Eric Li and Xuezhi Wang and Mostafa Dehghani and Siddhartha Brahma and Albert Webson and Shixiang Shane Gu and Zhuyun Dai and Mirac Suzgun and Xinyun Chen and Aakanksha Chowdhery and Dasha Valter and Sharan Narang and Gaurav Mishra and Adams Wei Yu and Vincent Zhao and Yanping Huang and Andrew M. Dai and Hongkun Yu and Slav Petrov and Ed Huai-hsin Chi and Jeff Dean and Jacob Devlin and Adam Roberts and Denny Zhou and Quoc Le and Jason Wei},
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups… 

Transcending Scaling Laws with 0.1% Extra Compute

U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks, reasoning tasks with chain-of-thought, multilingual tasks, MMLU and challenging BIG-Bench tasks, and is able to substantially improve the scaling properties of large language models on downstream metrics.

Multitask Vision-Language Prompt Tuning

This paper demonstrates the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task and shows many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning.

Galactica: A Large Language Model for Science

Galactica is introduced: a large language model that can store, combine and reason about scientific knowledge, and sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%.

Emergent Abilities of Large Language Models

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we

Help me write a Poem - Instruction Tuning as a Vehicle for Collaborative Poetry Writing

Recent work in training large language models (LLMs) to follow natural language instructions has opened up exciting opportunities for natural language interface design. Building on the prior success

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

The design decisions of publicly available instruction tuning methods are studied, and the development of Flan 2022 is broken down, showing Flan-T5 requires less instruction-tuned models as more computationally-efficient starting checkpoints for new tasks.

REPLUG: Retrieval-Augmented Black-Box Language Models

R E P LUG is introduced, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model and can be easily applied to any existing retrieval and language models.

Specializing Smaller Language Models towards Multi-Step Reasoning

This work shows two important aspects of model abilities: there exists a very complex balance/ tradeoff between language models’ multi-dimensional abilities and by paying the price of decreased generic ability, it can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability.

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization


Multimodal Chain-of-Thought Reasoning in Language Models

This work proposes Multimodal-CoT, a model under 1 billion parameters that in-corporates vision features in a decoupled training framework that outperforms the previous state-of-the-art LLM by 16% on the ScienceQA benchmark and even surpasses human performance.



PaLM: Scaling Language Modeling with Pathways

A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.

Finetuned Language Models Are Zero-Shot Learners

It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.

Transcending Scaling Laws with 0.1% Extra Compute

U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks, reasoning tasks with chain-of-thought, multilingual tasks, MMLU and challenging BIG-Bench tasks, and is able to substantially improve the scaling properties of large language models on downstream metrics.

Large Language Models Can Self-Improve

This work uses a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and conducts ablation studies and shows that ablation on reasoning is critical for self-improvement.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

It is found that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass it on 17 of the23 tasks.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

This paper proposes and develops a family of language models named GLaM, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Surprisingly, large pre-trained language models are able to perform complex multistep computations—even in the few-shot regime—when asked to perform the operation “step by step”, showing the results of intermediate computations.

Muppet: Massive Multi-task Representations with Pre-Finetuning

It is shown that pre-finetuning consistently improves performance for pretrained discriminators and generation models on a wide range of tasks while also significantly improving sample efficiency during fine-tuning, and that large-scale multi-tasking is crucial.