Corpus ID: 235458009

LoRA: Low-Rank Adaptation of Large Language Models

@article{Hu2021LoRALA,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.09685}
}
The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model… Expand
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning
TLDR
A unified framework, UNIPELT, is proposed, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup and often surpasses the upper bound when taking the best performance of all its submodules used individually on each task. Expand
SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer
As pre-trained language models have gotten larger, there has been growing interest in parameter-efficient methods to apply these models to downstream tasks. Building on the PROMPTTUNING approach ofExpand

References

SHOWING 1-10 OF 39 REFERENCES
Parameter-Efficient Transfer Learning for NLP
TLDR
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task. Expand
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
TLDR
This paper empirically shows that common pre-trained models have a very low intrinsic dimension, and connects intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization limits that are independent of the full parameter count. Expand
WARP: Word-level Adversarial ReProgramming
TLDR
An alternative approach based on adversarial reprogramming is presented, which attempts to learn taskspecific word embeddings that, when concatenated to the input text, instruct the language model to solve the specified task. Expand
Prefix-Tuning: Optimizing Continuous Prompts for Generation
TLDR
Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets
TLDR
A low-rank matrix factorization of the final weight layer is proposed and applied to DNNs for both acoustic modeling and language modeling, showing an equivalent reduction in training time and a significant loss in final recognition accuracy compared to a full-rank representation. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Low-rank plus diagonal adaptation for deep neural networks
  • Yong Zhao, Jinyu Li, Y. Gong
  • Computer Science
  • 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
TLDR
A scalable adaptation technique that adapts the deep neural network (DNN) model through the low-rank plus diagonal (LRPD) decomposition, inspired by observing that adaptation matrices are very close to an identity matrix or diagonally dominant. Expand
...
1
2
3
4
...