The Power of Scale for Parameter-Efficient Prompt Tuning

  title={The Power of Scale for Parameter-Efficient Prompt Tuning},
  author={Brian Lester and Rami Al-Rfou and Noah Constant},
In this work, we explore “prompt tuning,” a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin. More remarkably, through ablations on model size using T5… 

P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

The method P-Tuning v2 is an implementation of Deep Prompt Tuning (CITATION) optimized and adapted for NLU and can serve as an alternative to finetuning and a strong baseline for future research.

P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

P-Tuning v2 is a novel empirical finding that properly-optimized prompt tuning can be universally effective across a wide range of model scales and NLU tasks, where it matches the performance of finetuning while having only 0.1%-3% tuned parameters.

SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer

It is shown that SPoT significantly boosts the performance of Prompt Tuning across many tasks, and an efficient retrieval approach is proposed that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task.

Reducing Retraining by Recycling Parameter-Efficient Prompts

This work proposes and investigates several approaches to “Prompt Re-cycling”, where a prompt trained on a source model is transformed to work with the new target model, and shows that recycling between models is possible.

PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation

A new metric to accurately predict the prompt transferability is proposed, and a novel PoT approach (namely PANDA) is proposed that leverages the knowledge distillation technique to transfer the “knowledge” from the source prompt to the target prompt in a subtle manner and alleviate the catastrophic forgetting effectively.

PPT: Pre-trained Prompt Tuning for Few-shot Learning

This work proposes to pre-train prompts by adding soft prompts into the pre-training stage to obtain a better initialization, and names this Pre-trained Prompt Tuning framework “PPT” to ensure the generalization of PPT.

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

Though initially proposed as an efficient method to steer large models, some of the fascinating evidence discovered along with delta tuning could help further reveal the mechanisms of PLMs and even deep neural networks.

Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models

This work shows that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering, and recommends finetuned LMs for few- shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.

No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence

Pre-trained models have been shown effective in many code intelligence tasks. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the

Prompt-free and Efficient Few-shot Learning with Language Models

Experiments demonstrate that Perfect, a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting, also outperforms existing state-of-the-art few- shot learning methods.



Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which is called the prefix.

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

This paper empirically shows that common pre-trained models have a very low intrinsic dimension, and connects intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalizations bounds that are independent of the full parameter count.

GPT Understands, Too

It is shown that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning— which employs trainable continuous prompt embeddings and outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark.

Parameter-Efficient Transfer Learning for NLP

To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.