How many data points is a prompt worth?

@inproceedings{Scao2021HowMD,
  title={How many data points is a prompt worth?},
  author={Teven Le Scao and Alexander M. Rush},
  booktitle={NAACL},
  year={2021}
}
When fine-tuning pretrained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. Proponents of prompting have argued that prompts provide a method for injecting task-specific guidance, which is beneficial in low-data regimes. We aim to quantify this benefit through rigorous testing of prompts in a fair setting: comparing prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. By controlling for many… 

Figures and Tables from this paper

Are Prompt-based Models Clueless?
TLDR
Analyzing few-shot prompt- based models on MNLI, SNLI, HANS, and COPA has revealed that prompt-based models also exploit superficial cues, and while the models perform well on instances with superficial cue, they often underperform or only marginally outperform random accuracy on instances without superficial cues.
Evaluating Prompts Across Multiple Choice Tasks In a Zero-Shot Setting
TLDR
Collect and standardize prompts from a diverse range of tasks for use with tasks they were not designed for and evaluate these prompts across multiple choice datasets for a quantitative analysis of how certain attributes of a prompt affect performance.
A Few More Examples May Be Worth Billions of Parameters
TLDR
It is hypothesized that unlike open question answering, solving strategies for tasks with a more restricted output space transfer across examples, and can therefore be learned with small amounts of labeled data, the dynamics of increasing the number of model parameters versus thenumber of labeled examples across a wide variety of tasks.
How Many Data Samples is an Additional Instruction Worth?
TLDR
A subset of tasks in the expanded version of NATURAL INSTRUCTIONS is augmented with additional instructions and it is found that these significantly improve model performance, especially in the low-data regime.
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning
TLDR
It is demonstrated that, despite its advantages on low data regimes, finetuned prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inference heuristics based on lexical overlap, and it is shown that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
Do Prompt-Based Models Really Understand the Meaning of their Prompts?
TLDR
It is found that models learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively “good” prompts, and instruction-tuned models often produce good predictions with irrelevant and misleading prompts even at zero shots.
Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models
TLDR
This work shows that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering, and recommends finetuned LMs for few- shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
Towards Unified Prompt Tuning for Few-shot Text Classification
TLDR
A novel paradigm Prompt-Options-Verbalizer is proposed for joint prompt learning across different NLP tasks, forcing PLMs to capture task-invariant prompting knowledge, and a self-supervised task named Knowledge-enhanced Selective Masked Language Modeling is designed to improve the PLM’s generalization abilities for accurate adaptation to previously unseen tasks.
Automatic Multi-Label Prompting: Simple and Interpretable Few-Shot Classification
TLDR
This paper proposes Automatic Multi-Label Prompting (AMuLaP), a simple yet effective method to automatically select label mappings for few-shot text classification with prompting that achieves competitive performance on the GLUE benchmark without human effort or external resources.
ZeroPrompt: Scaling Prompt-Based Pretraining to 1, 000 Tasks Improves Zero-Shot Generalization
TLDR
The results show that task scaling can substantially improve training efficiency by 30 times in FLOPs, and a prompting method that incorporates a genetic algorithm to automatically search for the best prompt for unseen tasks, along with a few other improvements.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
TLDR
This paper analyzes BERT, RoBERTa, and ALBERT, fine-tuned on three commonly used datasets from the GLUE benchmark and shows that the observed instability is caused by optimization difficulties that lead to vanishing gradients.
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
TLDR
This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.
It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
TLDR
This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
Revisiting Few-sample BERT Fine-tuning
TLDR
It is found that parts of the BERT network provide a detrimental starting point for fine-tuning, and simply re-initializing these layers speeds up learning and improves performance.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.
Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
TLDR
It is demonstrated that the stability of finetuning and the average accuracy greatly increase when the proposed approach to regularizeFinetuning of BERT on downstream tasks in GLUE is used.
Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference
TLDR
This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.
...
...