Learning from Task Descriptions

  title={Learning from Task Descriptions},
  author={Orion Weller and Nicholas Lourie and Matt Gardner and Matthew E. Peters},
Typically, machine learning systems solve new tasks by training on thousands of examples. In contrast, humans can solve new tasks by reading some instructions, with perhaps an example or two. To take a step toward closing this gap, we introduce a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area. We instantiate this framework with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen… 

Figures and Tables from this paper

Learning to Generate Task-Specific Adapters from Task Description

Hypter is introduced, a framework that improves text-to-text transformer’s generalization ability to unseen tasks by training a hypernetwork to generate task-specific, light-weight adapters from task descriptions.

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

This work introduces NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances, and adopts generative pre-trained language models to encode task-specific instructions along with input and generate task output.

One-Shot Learning from a Demonstration with Hierarchical Latent Language

This work proposes a neural agent infused with hierarchical latent language—both at the level of task inference and subtask planning that is able to generalize unseen task-performing procedures and generalize their execution to other contexts.

Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions

This work uses the existing NLP datasets and the instructions used to crowdsource them to create NATURALINSTRUCTIONS, a dataset of instructions and task-specific input/output data that indicates that the existing models indeed benefit from instructions and hence, show improved generalization to new tasks.

What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment

This work uses the task of deciding whether a given string matches a regular expression to identify properties of tasks, instructions, and instances that make instruction learning challenging, and proposes Hard RegSet as a challenging instruction learning task and a controlled environment for studying instruction learning.

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Meta-tuning is proposed, which directly optimizes the zero-shot learning objective by finetuning pre-trained language models on a collection of datasets by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format.

How Many Data Samples is an Additional Instruction Worth?

A subset of tasks in the expanded version of NATURAL INSTRUCTIONS is augmented with additional instructions and it is found that these significantly improve model performance, especially in the low-data regime.

Zero-shot Learning by Generating Task-specific Adapters

HYPTER is introduced, a framework that improves zero-shot transferability by training a hypernetwork to generate task-specific adapters from task descriptions, and greatly reduces the number of parameters by using light-weight adapters.

Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)

A flexible and unified text-to-text paradigm called “Pretrain, Personalized Prompt, and Predict Paradigm” (P5) for recommendation, which unifies various recommendation tasks in a shared framework and will revolutionize the technical form of recommender systems towards universal recommendation engine.

Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks

This work introduces N ATURAL -I NSTRUCTIONS v 2, a collection of 1,600+ diverse language tasks and their expert written instructions that covers 70+ distinct task types, such as tagging, in-filling, and rewriting.



GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Neural Module Networks for Reasoning over Text

This work extends Neural module networks by introducing modules that reason over a paragraph of text, performing symbolic reasoning over numbers and dates in a probabilistic and differentiable manner, and proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Evaluating NLP Models via Contrast Sets

A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.

Zero-Shot Relation Extraction via Reading Comprehension

It is shown that relation extraction can be reduced to answering simple reading comprehension questions, by associating one or more natural-language questions with each relation slot, and that zero-shot generalization to unseen relation types is possible, at lower accuracy levels.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.