A Stability Analysis of Fine-Tuning a Pre-Trained Model

@article{Fu2023ASA,
  title={A Stability Analysis of Fine-Tuning a Pre-Trained Model},
  author={Zihao Fu and Anthony Man-Cho So and Nigel Collier},
  journal={ArXiv},
  year={2023},
  volume={abs/2301.09820}
}
Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work… 

Figures and Tables from this paper

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

This paper analyzes BERT, RoBERTa, and ALBERT, fine-tuned on three commonly used datasets from the GLUE benchmark and shows that the observed instability is caused by optimization difficulties that lead to vanishing gradients.

On the Effectiveness of Parameter-Efficient Fine-Tuning

A novel Second-order Approximation Method (SAM) which approximates the original problem with an analytically solvable optimization function and the tunable parameters are determined by directly optimizing the approximation function.

Improving Stability of Fine-Tuning Pretrained Language Models via Component-Wise Gradient Norm Clipping

This paper proposes a simple component-wise gradient norm clipping method to adjust the convergence speed for different components, which achieves consistent improvements in terms of generalization performance, convergence speed, and training stability.

Noise Stability Regularization for Improving BERT Fine-tuning

This work introduces a novel and effective regulariza-tion method to improve fine-tuning on NLP tasks, referred to as Layer-wiseNoiseStabilityRegularization (LNSR), and proves that this method gives a stabler regularization effect.

Revisiting Few-sample BERT Fine-tuning

It is found that parts of the BERT network provide a detrimental starting point for fine-tuning, and simply re-initializing these layers speeds up learning and improves performance.

Better Fine-Tuning by Reducing Representational Collapse

A simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance.

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

It is found that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large, and suggests that the easy two-step strategy of linear probing then full fine- Tuning (LP-FT) combines the benefits of both fine- tuning and linear probing.

Parameter-Efficient Transfer Learning for NLP

To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.

Muppet: Massive Multi-task Representations with Pre-Finetuning

It is shown that pre-finetuning consistently improves performance for pretrained discriminators and generation models on a wide range of tasks while also significantly improving sample efficiency during fine-tuning, and that large-scale multi-tasking is crucial.
...