• Corpus ID: 238215290

RAFT: A Real-World Few-Shot Text Classification Benchmark

@article{Alex2021RAFTAR,
  title={RAFT: A Real-World Few-Shot Text Classification Benchmark},
  author={Neel Alex and Eli Lifland and Lewis Tunstall and Abhishek Thakur and Pegah Maham and C. Jess Riedel and Emmie Hine and Carolyn Ashurst and Paul Sedille and Alexis Carlier and Michael Noetel and Andreas Stuhlm{\"u}ller},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.14076}
}
Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don’t directly answer this question. The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that… 

Figures and Tables from this paper

True Few-Shot Learning with Prompts—A Real-World Perspective

TLDR
An extensive study of Pet, a method that combines textual instructions with example-based finetuning, shows that, if correctly configured, Pet performs strongly in true few-shot settings without a dev set and underpin the belief that learning from instructions will play an important role on the path towards human-like few- shot learning capabilities.

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

TLDR
A new parameter-efficient fine-tuning method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters.

FewJoint: few-shot learning for joint dialogue understanding

TLDR
This paper introduces FewJoint, the first FSL benchmark for joint dialogue understanding, and guides slot with explicit intent information and proposes a novel trust gating mechanism that blocks low-confidence intent information to ensure high quality sharing.

HFGNN-Proto: Hesitant Fuzzy Graph Neural Network-Based Prototypical Network for Few-Shot Text Classification

TLDR
This paper proposes a novel hesitant fuzzy graph neural network (HFGNN) model that explores the multi-attribute relations between samples and combines HFGNN with the Prototypical Network to achieve few-shot text classification.

Predictability and Surprise in Large Generative Models

TLDR
This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

Efficient Few-Shot Learning Without Prompts

TLDR
This work proposes S ET F IT, an efficient and prompt-free framework for few-shot tuning of Sentence Transformers (ST), which achieves high accuracy with orders of magnitude less parameters than existing techniques.

FLEX: Unifying Evaluation for Few-Shot NLP

TLDR
The FLEX Principles are formulated, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation that include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable.

References

SHOWING 1-10 OF 46 REFERENCES

Making Pre-trained Language Models Better Few-shot Learners

TLDR
The LM-BFF approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.

True Few-Shot Learning with Language Models

TLDR
This work evaluates the few-shot ability of LMs when such held-out examples are unavailable, a setting the authors call true few- shot learning, and suggests that prior work significantly overestimated thetrue few-shots ability ofLMs given the difficulty of few-Shot model selection.

Efficient Intent Detection with Dual Sentence Encoders

TLDR
The usefulness and wide applicability of the proposed intent detectors are demonstrated, showing that they outperform intent detectors based on fine-tuning the full BERT-Large model or using BERT as a fixed black-box encoder on three diverse intent detection data sets.

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

TLDR
This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.

CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP

TLDR
This paper presents the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format, and reveals that the few- shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks.

Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach

TLDR
This work benchmarks the 0Shot-TC problem by providing unified datasets, standardized evaluations, and state-of-the-art baselines, and unify the 0 shot-TC of diverse aspects within a textual entailment formulation and study it this way.

Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions

TLDR
This work uses the existing NLP datasets and the instructions used to crowdsource them to create NATURALINSTRUCTIONS, a dataset of instructions and task-specific input/output data that indicates that the existing models indeed benefit from instructions and hence, show improved generalization to new tasks.

What Makes Good In-Context Examples for GPT-3?

TLDR
This work proposes to retrieve examples that are semantically-similar to a test query sample to formulate its corresponding prompt, and evaluates the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random selection baseline.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

TLDR
This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.