Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

@inproceedings{Wortsman2022ModelSA,
  title={Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author={Mitchell Wortsman and Gabriel Ilharco and Samir Yitzhak Gadre and Rebecca Roelofs and Raphael Gontijo-Lopes and Ari S. Morcos and Hongseok Namkoong and Ali Farhadi and Yair Carmon and Simon Kornblith and Ludwig Schmidt},
  booktitle={International Conference on Machine Learning},
  year={2022}
}
The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different… 

Parameter Averaging for Feature Ranking

This work introduces a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab, and demonstrates that the XTab can be used to obtain the global feature importance that is not sensitive to sub-optimal model initialisation.

Meta-Ensemble Parameter Learning

WeightFormer is introduced, a Transformer-based model that can predict student network weights layer by layer in a forward pass, according to the teacher model parameters, and can be straightforwardly extended to handle unseen teacher models compared with knowledge distillation and even exceeds average ensemble with small-scale tuning.

Recycling diverse models for out-of-distribution generalization

Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

AdaMix is proposed as a general PEFT method that tunes a mixture of adaptation modules – given the underlyingPEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen, and outperforms SOTA parameter-efficient fine-tuning and full model fine- Tuning for both NLU and NLG tasks.

Diverse Weight Averaging for Out-of-Distribution Generalization

Diverse Weight Averaging is proposed that makes a simple change to this strategy: DiWA averages the weights obtained from several independent training runs rather than from a single run, and highlights the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error.

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

This work introduces a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques and demonstrates these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks.

Dataless Knowledge Fusion by Merging Weights of Language Models

This paper proposes a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models and finds that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling.

The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

How properties of the data influence effective robustness increases with the larger size, more diversity, and higher example difficulty of the dataset, and it is shown that it increases with a larger size and more diversity.

Pre-train, fine-tune, interpolate: a three-stage strategy for domain generalization

The goal of domain generalization is to train models that generalize well to unseen domains by interpolating the featurizer with auxiliary featurizers trained on auxiliary datasets, which improves the performance of existing state-of-the-art models on the DomainBed benchmark.

PARAMETER-EFFICIENT TUNING OF LARGE LANGUAGE MODELS

This work introduces a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques and proposes a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer.
...

References

SHOWING 1-10 OF 113 REFERENCES

No One Representation to Rule Them All: Overlapping Features of Training Methods

A large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets finds that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors.

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

This paper proposes hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations, and proposes a parameter efficient version, hyper-batch ensembls, which builds on the layer structure of batch ensembleles and self-tuning networks.

Robust fine-tuning of zero-shot models

This work introduces a simple and effective method for improving robustness whilefine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT), providing large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution.

Deep Ensembles for Low-Data Transfer Learning

This work shows that the nature of pre-training itself is a performant source of diversity, and proposes a practical algorithm that efficiently identifies a subset ofPre-trained models for any downstream dataset and achieves state-of-the-art performance at a much lower inference budget.

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.

Sharpness-Aware Minimization for Efficiently Improving Generalization

This work introduces a novel, effective procedure for simultaneously minimizing loss value and loss sharpness, Sharpness-Aware Minimization (SAM), which improves model generalization across a variety of benchmark datasets and models, yielding novel state-of-the-art performance for several.

The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

How properties of the data influence effective robustness increases with the larger size, more diversity, and higher example difficulty of the dataset, and it is shown that it increases with a larger size and more diversity.

Averaging Weights Leads to Wider Optima and Better Generalization

It is shown that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, and Stochastic Weight Averaging (SWA) is extremely easy to implement, improves generalization, and has almost no computational overhead.

Exploring the Limits of Weakly Supervised Pretraining

This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

Explicit Inductive Bias for Transfer Learning with Convolutional Networks

This paper investigates several regularization schemes that explicitly promote the similarity of the final solution with the initial model, and eventually recommends a simple $L^2$ penalty with the pre-trained model being a reference as the baseline of penalty for transfer learning tasks.
...