Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right

@article{Holtzman2021SurfaceFC,
  title={Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right},
  author={Ari Holtzman and Peter West and Vered Schwartz and Yejin Choi and Luke Zettlemoyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.08315}
}
Large language models have shown promising results in zero-shot settings. For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition—wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. “computer” and “PC.” Since probability mass is finite… 

Figures and Tables from this paper

Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification
TLDR
This work focuses on incorporating external knowledge into the verbalizer, forming a knowledgeable prompt Tuning (KPT), to improve and stabilize prompttuning.
GPT-3 for Few-Shot Dialogue State Tracking
TLDR
It is found that natural language instructions in the prompt have little impact on performance, larger language models do not always induce higher downstream performance and that GPT-3 is highly sensitive to the order and number of the in-context examples.
Nearest Neighbor Zero-Shot Inference
TLDR
The introduction of fuzzy verbalizers which leverage the sparse kNN distribution for downstream tasks by automatically associating each classification label with a set of natural language tokens shows that augment-ing a language model with retrieval can bring signs of gains for zero-shot inference.
Language models show human-like content effects on reasoning
TLDR
This work hypothesized that language models would show human-like content content on abstract reasoning problems, and explored this hypothesis across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task.
Evaluating Prompts Across Multiple Choice Tasks In a Zero-Shot Setting
TLDR
Collect and standardize prompts from a diverse range of tasks for use with tasks they were not designed for and evaluate these prompts across multiple choice datasets for a quantitative analysis of how certain attributes of a prompt affect performance.
Coherence boosting: When your pretrained language model is not paying enough attention
TLDR
It is found that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
TLDR
It is argued that ZEROGEN can also provide useful insights from the perspective of data016 free model-agnostic knowledge distillation, 017 and unreferenced text generation evaluation.
Do Language Models Learn Commonsense Knowledge?
Language models (LMs) trained on large amounts of data (e.g., Brown et al., 2020; Patwary et al., 2021) have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup.
Boosting coherence of language models
TLDR
It is found that coherence boosting with state-of-the-art models for various zeroshot NLP tasks yields performance gains with no additional training.
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
TLDR
It is shown that ground truth demonstrations are in fact not required and other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.
...
...

References

SHOWING 1-10 OF 73 REFERENCES
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Sparse Sequence-to-Sequence Models
TLDR
Sparse sequence-to-sequence models are proposed, rooted in a new family of \alpha-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any \alpha > 1.
Character-level Convolutional Networks for Text Classification
TLDR
This article constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results in text classification.
SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning
TLDR
The two systems that competed in this task as part of SemEval-2012 are described, and their results are compared to those achieved in previously published research.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
TLDR
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
  • Proceedings of the 2020 Conference on Empirical Methods
  • 2020
How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438
  • 2020
Finetuned Language Models Are Zero-Shot Learners
TLDR
It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
True Few-Shot Learning with Language Models
TLDR
This work evaluates the few-shot ability of LMs when such held-out examples are unavailable, a setting the authors call true few- shot learning, and suggests that prior work significantly overestimated thetrue few-shots ability ofLMs given the difficulty of few-Shot model selection.
...
...