A Stability Analysis of Fine-Tuning a Pre-Trained Model
@article{Fu2023ASA, title={A Stability Analysis of Fine-Tuning a Pre-Trained Model}, author={Zihao Fu and Anthony Man-Cho So and Nigel Collier}, journal={ArXiv}, year={2023}, volume={abs/2301.09820} }
Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work…
Figures and Tables from this paper
68 References
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
- 2021
Computer Science
ICLR
This paper analyzes BERT, RoBERTa, and ALBERT, fine-tuned on three commonly used datasets from the GLUE benchmark and shows that the observed instability is caused by optimization difficulties that lead to vanishing gradients.
On the Effectiveness of Parameter-Efficient Fine-Tuning
- 2022
Computer Science
ArXiv
A novel Second-order Approximation Method (SAM) which approximates the original problem with an analytically solvable optimization function and the tunable parameters are determined by directly optimizing the approximation function.
Improving Stability of Fine-Tuning Pretrained Language Models via Component-Wise Gradient Norm Clipping
- 2022
Computer Science
EMNLP
This paper proposes a simple component-wise gradient norm clipping method to adjust the convergence speed for different components, which achieves consistent improvements in terms of generalization performance, convergence speed, and training stability.
Noise Stability Regularization for Improving BERT Fine-tuning
- 2021
Computer Science
NAACL
This work introduces a novel and effective regulariza-tion method to improve fine-tuning on NLP tasks, referred to as Layer-wiseNoiseStabilityRegularization (LNSR), and proves that this method gives a stabler regularization effect.
Revisiting Few-sample BERT Fine-tuning
- 2021
Business
ICLR
It is found that parts of the BERT network provide a detrimental starting point for fine-tuning, and simply re-initializing these layers speeds up learning and improves performance.
Better Fine-Tuning by Reducing Representational Collapse
- 2021
Computer Science
ICLR
A simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance.
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
- 2022
Computer Science
ICLR
It is found that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large, and suggests that the easy two-step strategy of linear probing then full fine- Tuning (LP-FT) combines the benefits of both fine- tuning and linear probing.
Parameter-Efficient Transfer Learning for NLP
- 2019
Computer Science
ICML
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
- 2020
Computer Science
ArXiv
This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.
Muppet: Massive Multi-task Representations with Pre-Finetuning
- 2021
Computer Science
EMNLP
It is shown that pre-finetuning consistently improves performance for pretrained discriminators and generation models on a wide range of tasks while also significantly improving sample efficiency during fine-tuning, and that large-scale multi-tasking is crucial.