• Corpus ID: 235899116

FLEX: Unifying Evaluation for Few-Shot NLP

@article{Bragg2021FLEXUE,
  title={FLEX: Unifying Evaluation for Few-Shot NLP},
  author={Jonathan Bragg and Arman Cohan and Kyle Lo and Iz Beltagy},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.07170}
}
Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. In response, we formulate the FLEX Principles, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation. These… 

Figures and Tables from this paper

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding
TLDR
An evaluation framework is introduced that improves previous evaluation procedures in three key aspects, i.e., test performance, dev test correlation, and stability, and reveals new insights about few-shot natural language understanding methods.
True Few-Shot Learning with Prompts -- A Real-World Perspective
TLDR
An extensive study of PET, a method that combines textual instructions with examplebased finetuning, shows that, if correctly configured, PET performs strongly in a true few-shot setting, i.e., without a dev set.
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
TLDR
This paper introduces CROSSFIT, a task setup for studying cross-task few-shot learning ability, which standardizes seen/unseen task splits, data access during different learning stages, and the evaluation protocols, and presents NLP Few-shot Gym, a repository of 160 few- Shots tasks, covering diverse task categories and applications, and converted to a unified text-to-text format.
PPT: Pre-trained Prompt Tuning for Few-shot Learning
TLDR
It is found that prompt tuning performs comparably with conventional full-model fine-tuning when downstream data are sufficient, whereas it performs much worse under few-shot learning settings, which may hinder the application of prompt tuning in practice.
Few-Shot Self-Rationalization with Natural Language Prompts
TLDR
This work identifies the right prompting approach by extensively exploring natural language prompts on FEB and demonstrates that making progress on few-shot self-rationalization is possible, and presents FEB—a standardized collection of four existing Englishlanguage datasets and associated metrics.
Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections
TLDR
Meta-tuning is proposed, which directly optimizes the zero-shot learning objective by finetuning pre-trained language models on a collection of datasets by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format.
PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
TLDR
A pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data and outperforms current state-of-the-art models on most of these settings with large margins.
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
TLDR
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.
Research Statement: Climbing the Generality Ladder in NLP
I am broadly interested in the computational foundations of intelligent behavior through the lens of natural language. The overarching theme of my research is centered around developing algorithms
RAFT: A Real-World Few-Shot Text Classification Benchmark
TLDR
The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment, revealing areas current techniques struggle with: reasoning over long texts and tasks with many classes.

References

SHOWING 1-10 OF 89 REFERENCES
Making Pre-trained Language Models Better Few-shot Learners
TLDR
The LM-BFF approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding
TLDR
An evaluation framework is introduced that improves previous evaluation procedures in three key aspects, i.e., test performance, dev test correlation, and stability, and reveals new insights about few-shot natural language understanding methods.
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
TLDR
This paper introduces CROSSFIT, a task setup for studying cross-task few-shot learning ability, which standardizes seen/unseen task splits, data access during different learning stages, and the evaluation protocols, and presents NLP Few-shot Gym, a repository of 160 few- Shots tasks, covering diverse task categories and applications, and converted to a unified text-to-text format.
Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models
TLDR
This work shows that finetuning LMs in the few- shot setting can considerably reduce the need for prompt engineering, and recommends finetuned LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
Meta-learning for Few-shot Natural Language Processing: A Survey
TLDR
This paper tries to provide clearer definitions, progress summary and some common datasets of applying meta-learning to few-shot NLP domain, especially few- shot applications.
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
TLDR
This work uses the generative nature of the language models to construct an artificial development set and based on entropy statistics of the candidate permutations from this set the authors identify performant prompts and improves upon GPT-family models by on average 13% relative across eleven different established text classification tasks.
True Few-Shot Learning with Language Models
TLDR
This work evaluates the few-shot ability of LMs when such held-out examples are unavailable, a setting the authors call true few- shot learning, and suggests that prior work significantly overestimated thetrue few-shots ability ofLMs given the difficulty of few-Shot model selection.
Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections
TLDR
Meta-tuning is proposed, which directly optimizes the zero-shot learning objective by finetuning pre-trained language models on a collection of datasets by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format.
Learning to Classify Intents and Slot Labels Given a Handful of Examples
TLDR
A new few-shot learning task is proposed to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios, and it is demonstrated that joint training as well as the use of pre-trained language models are complementary to these few- shot learning methods and yield further gains.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.
...
1
2
3
4
5
...