Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks

@article{Wang2022BenchmarkingGV,
  title={Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks},
  author={Yizhong Wang and Swaroop Mishra and Pegah Alipoormolabashi and Yeganeh Kordi and Amirreza Mirzaei and A. Arunkumar and Arjun Ashok and Arut Selvan Dhanasekaran and Atharva Naik and David Stap and Eshaan Pathak and Giannis Karamanolakis and Haizhi Gary Lai and Ishan Purohit and Ishani Mondal and Jacob Anderson and Kirby Kuznia and Krima Doshi and Maitreya Patel and Kuntal Kumar Pal and M. Moradshahi and Mihir Parmar and Mirali Purohit and Neeraj Varshney and Phani Rohitha Kaza and Pulkit Verma and Ravsehaj Singh Puri and Rushang Karia and Shailaja Keyur Sampat and Savan Doshi and Siddharth Deepak Mishra and Sujan C. Reddy and Sumanta Patro and Tanay Dixit and Xudong Shen and Chitta Baral and Yejin Choi and Hannaneh Hajishirzi and Noah A. Smith and Daniel Khashabi},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.07705}
}
How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce N ATURAL -I NSTRUCTIONS v 2 , a collection of 1,600+ diverse language tasks and their expert written instructions. More impor-tantly, the benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting. This benchmark is collected with contributions 
Instruction Induction: From Few Examples to Natural Language Task Descriptions
TLDR
It is discovered that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; this surprising result suggests that instruction induction might be a viable learning paradigm in and of itself.
DIRECTOR: Generator-Classifiers For Supervised Language Modeling
TLDR
A new architecture, DIRECTOR, that consists of a unified generatorclassifier with both a language modeling and a classification head for each output token that outperforms existing model guiding approaches in terms of both accuracy and efficiency is introduced.
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
TLDR
A new parameter-efficient fine-tuning method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters.
Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
TLDR
This work introduces I NSTRUCT D IAL, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets, and reveals that it enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting.
Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions
TLDR
This work hypothesizes that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write similar examples that are then over-represented in the collected data, and studies this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns.
RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning
TLDR
RLP ROMPT formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward to overcome the complexity and stochasticity of reward signals by the large LM environment, and incorporates effective reward stabilization that substantially enhances the training efflciency.
Chain of Thought Prompting Elicits Reasoning in Large Language Models
TLDR
Experiments show that inducing a chain of thought via prompting can be enabled byently large language models to better perform reasoning tasks that otherwise have flat scaling curves.

References

SHOWING 1-10 OF 64 REFERENCES
Learning from Task Descriptions
TLDR
This work introduces a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area, and instantiates it with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks.
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
TLDR
This work introduces NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances, and adopts generative pre-trained language models to encode task-specific instructions along with input and generate task output.
Finetuned Language Models Are Zero-Shot Learners
TLDR
It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
Learning to Generate Task-Specific Adapters from Task Description
TLDR
Hypter is introduced, a framework that improves text-to-text transformer’s generalization ability to unseen tasks by training a hypernetwork to generate task-specific, light-weight adapters from task descriptions.
Can language models learn from explanations in context?
TLDR
Investigating whether explanations of few-shot examples can allow language models to adapt more efitively and showing that explanations of examples can improve performance shows that explanations can support the in-context learning abilities of large language models on challenging tasks.
Training language models to follow instructions with human feedback
TLDR
The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
TLDR
A joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks and uses a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.
MetaICL: Learning to Learn In Context
We introduce MetaICL (Meta-training for In-Context Learning), a new meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learning on a large set
One-Shot Learning from a Demonstration with Hierarchical Latent Language
TLDR
This work proposes a neural agent infused with hierarchical latent language—both at the level of task inference and subtask planning that is able to generalize unseen task-performing procedures and generalize their execution to other contexts.
Transformers: State-of-the-Art Natural Language Processing
TLDR
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
...
...