UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

  title={UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning},
  author={Yuning Mao and Lambert Mathias and Rui Hou and Amjad Almahairi and Hao Ma and Jiawei Han and Wen-tau Yih and Madian Khabsa},
Recent parameter-efficient language model tuning (PELT) methods manage to match the performance of fine-tuning with much fewer trainable parameters and perform especially well when training data is limited. However, different PELT methods may perform rather differently on the same task, making it nontrivial to select the most appropriate method for a specific task, especially considering the fast-growing number of new PELT methods and tasks. In light of model diversity and the difficulty of… 

Figures and Tables from this paper


This work introduces a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques and proposes a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer.

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

This work introduces a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques and demonstrates these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks.

AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning

AUTO PEFT is proposed, a novel framework to traverse a rich configuration search space spanning multiple representative PEFT modules along withner-grained con⬁Guration decisions over the modules (e.g., parameter budget, insertion layer), outperforming existing PEFT methods on average on the standard GLUE benchmark.

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

AdaMix is proposed as a general PEFT method that tunes a mixture of adaptation modules – given the underlyingPEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen.

Neural Prompt Search

This paper proposes Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset.

Revisiting Parameter-Efficient Tuning: Are We Really There Yet?

It is observed that the number of trainable rameters and training iterations are two main factors: reducing trainable parameters and pro- 024 longing training iterations may lead to higher stability in PETuning methods.

Parameter-Efficient Tuning on Layer Normalization for Pre-trained Language Models

SOTA performance is achieved and LN-tuning is proposed, which is time-efficient and performs much better than baselines with less than 0.1% tunable parameters by tuning the gain and bias term of the LayerNorm module with only 0.03% parameters.

DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation

The results show that the authors can train dynamic search-free models with DyLoRA at least 7 × faster than LoRA without significantly compromising performance, and their models can perform consistently well on a much larger range of ranks compared to LoRA.

Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization

An extensive empirical evaluation of various tuning strategies for multilin-5 gual learning, particularly in the context of text summarization, to establish a new state-of-the-art on the XL-Sum dataset.

Parameter-Efficient Fine-Tuning Design Spaces

This work introduces parameter-efficient fine-tuning design spaces that parameterize tuning structures and tuning strategies and discovers design patterns that are applicable to different experimental settings.



The Power of Scale for Parameter-Efficient Prompt Tuning

This work explores “prompt tuning”, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks, and shows that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation

It is demonstrated that 1) adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks; 2) it is more robust to overfitting and less sensitive to changes in learning rates.

COMPACTER: Efficient Low-Rank Hypercomplex Adapter Layers

  • Joe Davison
  • Computer Science
  • 2021
This work proposes COMPACTER, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work.

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

This paper shows that one can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks, which condition on task, adapter position, and layer id in a transformer model.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which is called the prefix.

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

This paper examines two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability, and shows that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality.

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on

Parameter-Efficient Transfer Learning with Diff Pruning

Diff pruning can match the performance of finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model’s parameters per task and scales favorably in comparison to popular pruning approaches.

AdapterHub: A Framework for Adapting Transformers

AdaptersHub is proposed, a framework that allows dynamic “stiching-in” of pre-trained adapters for different tasks and languages that enables scalable and easy access to sharing of task-specific models, particularly in low-resource scenarios.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.