Parameter-Efficient Transfer Learning with Diff Pruning

@inproceedings{Guo2021ParameterEfficientTL,
  title={Parameter-Efficient Transfer Learning with Diff Pruning},
  author={Demi Guo and Alexander M. Rush and Yoon Kim},
  booktitle={ACL},
  year={2021}
}
The large size of pretrained networks makes them difficult to deploy for multiple tasks in storage-constrained settings. Diff pruning enables parameter-efficient transfer learning that scales well with new tasks. The approach learns a task-specific “diff” vector that extends the original pretrained parameters. This diff vector is adaptively pruned during training with a differentiable approximation to the L0-norm penalty to encourage sparsity. As the number of tasks increases, diff pruning… 

Figures and Tables from this paper

Composable Sparse Fine-Tuning for Cross-Lingual Transfer
TLDR
This work introduces a new fine-tuning method that outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI.
Learn-to-Share: A Hardware-friendly Transfer Learning Framework Exploiting Computation and Parameter Sharing
TLDR
This work proposes LeTS, a framework that leverages both computation and parameter sharing across multiple tasks, and proposes a novel neural architecture that contains a fixed pre-trained transformer model, plus learnable additive components for sub-tasks.
AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks
TLDR
The proposed AdapterBias adds a token-dependent shift to the hidden output of transformer layers to adapt to downstream tasks with only a vector and a linear layer to dramatically reduce the trainable parameters.
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on
PARAMETER-EFFICIENT TRANSFER LEARNING
TLDR
This paper re-frames state-of-the-art parameter-efficient transfer learning methods as modifications to specific hidden states in pretrained models, and defines a set of design dimensions along which different methods vary, achieving comparable results to fine-tuning all parameters on all four tasks.
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models
TLDR
This work proposes a framework for resourceand parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights, and leverages sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via magnitude-based pruning and `1 sparse regularization.
Towards a Unified View of Parameter-Efficient Transfer Learning
TLDR
This paper re-frames state-of-the-art parameter-efficient transfer learning methods as modifications to specific hidden states in pretrained models, and defines a set of design dimensions along which different methods vary, achieving comparable results to fine-tuning all parameters on all four tasks.
Revisiting Parameter-Efficient Tuning: Are We Really There Yet?
TLDR
This work conducts the first comprehensive investigation into the training and evaluation of PETuning methods and finds PETuning cannot yield consistently competitive performance while finetuning remains to be the best-performing method in midand high-resource settings.
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning
TLDR
A unified framework, UniPELT, is proposed, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup via gating mechanism, indicating that a mixture of multiple P ELT methods may be inherently more effective than single methods.
Controlling the Focus of Pretrained Language Generation Models
TLDR
This work augments a pretrained model with trainable “focus vectors” that are directly applied to the model’s embeddings, while the model itself is kept fixed to develop a control mechanism by which a user can select spans of context as “highlights” for the model to focus on, and generate relevant output.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 85 REFERENCES
Parameter-Efficient Transfer Learning for NLP
TLDR
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.
Structured Pruning of Large Language Models
TLDR
A novel, structured pruning approach based on low rank factorization and augmented Lagrangian L0 norm regularization is presented, which achieves significant inference speedups while matching or outperforming the authors' unstructured pruning baseline at various sparsity levels.
TinyBERT: Distilling BERT for Natural Language Understanding
TLDR
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
Efficient Parametrization of Multi-domain Deep Neural Networks
TLDR
This paper proposes to consider universal parametric families of neural networks, which still contain specialized problem-specific models, but differing only by a small number of parameters, and shows that these universal parametrization are very effective for transfer learning, where they outperform traditional fine-tuning techniques.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Multitask Learning
  • R. Caruana
  • Computer Science
    Encyclopedia of Machine Learning and Data Mining
  • 1998
TLDR
Suggestions for how to get the most out of multitask learning in artificial neural nets are presented, an algorithm forMultitask learning with case-based methods like k-nearest neighbor and kernel regression is presented, and an algorithms for multitasklearning in decision trees are sketched.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Linguistic Knowledge and Transferability of Contextual Representations
TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
Multi-Task Deep Neural Networks for Natural Language Understanding
TLDR
A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.
RoBERTa: A Robustly
  • Optimized BERT Pretraining Approach
  • 2019
...
1
2
3
4
5
...