Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

@inproceedings{Lu2022FantasticallyOP,
  title={Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity},
  author={Yao Lu and Max Bartolo and Alastair Moore and Sebastian Riedel and Pontus Stenetorp},
  booktitle={ACL},
  year={2022}
}
When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are “fantastic” and some not. We analyse this phenomenon in detail, establishing that: it is present… 
True Few-Shot Learning with Prompts—A Real-World Perspective
TLDR
An extensive study of PET, a method that combines textual instructions with examplebased finetuning, shows that, if correctly configured, PET performs strongly in a true few-shot setting, i.e., without a dev set.
Prompt-free and Efficient Few-shot Learning with Language Models
TLDR
Experiments demonstrate that Perfect, a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting, also outperforms existing state-of-the-art few- shot learning methods.
GPT-3 for Few-Shot Dialogue State Tracking
TLDR
It is found that natural language instructions in the prompt have little impact on performance, larger language models do not always induce higher downstream performance and that GPT-3 is highly sensitive to the order and number of the in-context examples.
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
TLDR
It is shown that ground truth demonstrations are in fact not required and other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.
Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models
TLDR
This work shows that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering, and recommends finetuned LMs for few- shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
Co-training Improves Prompt-based Learning for Large Language Models
TLDR
It is demonstrated that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data and cotraining makes it possible to improve the original prompt model and at the same time learn a smaller, downstream task-specific model.
Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator
TLDR
Self-generated incontext learning (SG-ICL) is proposed, which generates demonstrations for in-context learning from PLM itself to minimize the reliance on the external demonstration.
Black-box Prompt Learning for Pre-trained Language Models
TLDR
This work considers a new scenario, where one does not have access to a pre-trained model except for its outputs given inputs, and calls this problem blackbox fine-tuning, and proposes the solution BLACKBOX prompt, a new technique in the promptlearning family, which can leverage the knowledge learned by pre- trained models from the pre-training corpus.
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
TLDR
The basics of this promising paradigm, a unified set of mathematical notations that can cover a wide variety of existing work, and existing work along several dimensions are described, e.g. the choice of pre-trained models, prompts, and tuning strategies are described.
FLEX: Unifying Evaluation for Few-Shot NLP
TLDR
The FLEX Principles are formulated, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation that include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Calibrate Before Use: Improving Few-Shot Performance of Language Models
TLDR
This work first estimates the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A", and then fits calibration parameters that cause the prediction for this input to be uniform across answers.
Making Pre-trained Language Models Better Few-shot Learners
TLDR
The LM-BFF approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
TLDR
This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
True Few-Shot Learning with Language Models
TLDR
This work evaluates the few-shot ability of LMs when such held-out examples are unavailable, a setting the authors call true few- shot learning, and suggests that prior work significantly overestimated thetrue few-shots ability ofLMs given the difficulty of few-Shot model selection.
What Makes Good In-Context Examples for GPT-3?
TLDR
This work proposes to retrieve examples that are semantically-similar to a test query sample to formulate its corresponding prompt, and evaluates the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random selection baseline.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
A Deep Reinforced Model for Abstractive Summarization
TLDR
A neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL) that produces higher quality summaries.
How Context Affects Language Models' Factual Predictions
TLDR
This paper reports that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
...
...