• Corpus ID: 238215290

RAFT: A Real-World Few-Shot Text Classification Benchmark

@article{Alex2021RAFTAR,
  title={RAFT: A Real-World Few-Shot Text Classification Benchmark},
  author={Neel Alex and Eli Lifland and Lewis Tunstall and Abhishek Thakur and Pegah Maham and C. Jess Riedel and Emmie Hine and Carolyn Ashurst and Paul Sedille and Alexis Carlier and Michael Noetel and Andreas Stuhlm{\"u}ller},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.14076}
}
Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don’t directly answer this question. The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that… 

Figures and Tables from this paper

True Few-Shot Learning with Prompts—A Real-World Perspective
TLDR
An extensive study of Pet, a method that combines textual instructions with example-based finetuning, shows that, if correctly configured, Pet performs strongly in true few-shot settings without a dev set and underpin the belief that learning from instructions will play an important role on the path towards human-like few- shot learning capabilities.
FLEX: Unifying Evaluation for Few-Shot NLP
TLDR
The FLEX Principles are formulated, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation that include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable.
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
TLDR
A new parameter-efficient fine-tuning method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters.
FewJoint: few-shot learning for joint dialogue understanding
TLDR
This paper introduces FewJoint, the first FSL benchmark for joint dialogue understanding, and guides slot with explicit intent information and proposes a novel trust gating mechanism that blocks low-confidence intent information to ensure high quality sharing.
Predictability and Surprise in Large Generative Models
TLDR
This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

References

SHOWING 1-10 OF 53 REFERENCES
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Making Pre-trained Language Models Better Few-shot Learners
TLDR
The LM-BFF approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
True Few-Shot Learning with Language Models
TLDR
This work evaluates the few-shot ability of LMs when such held-out examples are unavailable, a setting the authors call true few- shot learning, and suggests that prior work significantly overestimated thetrue few-shots ability ofLMs given the difficulty of few-Shot model selection.
Efficient Intent Detection with Dual Sentence Encoders
TLDR
The usefulness and wide applicability of the proposed intent detectors are demonstrated, showing that they outperform intent detectors based on fine-tuning the full BERT-Large model or using BERT as a fixed black-box encoder on three diverse intent detection data sets.
FLEX: Unifying Evaluation for Few-Shot NLP
TLDR
The FLEX Principles are formulated, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation that include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable.
Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference
TLDR
This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
TLDR
This paper presents the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format, and reveals that the few- shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks.
Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach
TLDR
This work benchmarks the 0Shot-TC problem by providing unified datasets, standardized evaluations, and state-of-the-art baselines, and unify the 0 shot-TC of diverse aspects within a textual entailment formulation and study it this way.
Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions
TLDR
This work uses the existing NLP datasets and the instructions used to crowdsource them to create NATURALINSTRUCTIONS, a dataset of instructions and task-specific input/output data that indicates that the existing models indeed benefit from instructions and hence, show improved generalization to new tasks.
...
...