Corpus ID: 235458053

Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

@article{Wei2021WhyDP,
  title={Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning},
  author={Colin Wei and Sang Michael Xie and Tengyu Ma},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.09226}
}
Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text — the downstream classifier must recover a function of the posterior distribution over the latent variables. We… Expand

Figures from this paper

References

SHOWING 1-10 OF 34 REFERENCES
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
TLDR
The results show that purely distributional information largely explains the success of pretraining, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge. Expand
Making Pre-trained Language Models Better Few-shot Learners
TLDR
The LM-BFF approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning. Expand
The Power of Scale for Parameter-Efficient Prompt Tuning
TLDR
This work explores “prompt tuning”, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks, and shows that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning. Expand
What do you learn from context? Probing for sentence structure in contextualized word representations
TLDR
A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
Prefix-Tuning: Optimizing Continuous Prompts for Generation
TLDR
Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Expand
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fill-in-the-blanks problemsExpand
WARP: Word-level Adversarial ReProgramming
TLDR
An alternative approach based on adversarial reprogramming is presented, which attempts to learn taskspecific word embeddings that, when concatenated to the input text, instruct the language model to solve the specified task. Expand
Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
TLDR
The proposed method provides an effective way of extracting constituency trees from the pre-trained LMs without training, and reports intriguing findings in the induced trees, including the fact that pre- trained LMs outperform other approaches in correctly demarcating adverb phrases in sentences. Expand
Deep Contextualized Word Representations
TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals. Expand
...
1
2
3
4
...