• Corpus ID: 247011290

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

@article{Kumar2022FineTuningCD,
  title={Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution},
  author={Ananya Kumar and Aditi Raghunathan and Robbie Jones and Tengyu Ma and Percy Liang},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.10054}
}
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer—the “head”). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10… 

Figures and Tables from this paper

Test-Time Robust Personalization for Federated Learning
TLDR
This work identifies the pitfalls of existing works under test-time distribution shifts and proposes a novel test- time robust personalization method, namely Federated Test-time Head Ensemble plus tuning (FedTHE+), which illustrates the advancement of FedTHE+ over strong competitors.
Exploring the Design of Adaptation Protocols for Improved Generalization and Machine Learning Safety
TLDR
It is hypothesized and empirically seen that using hardness-promoting augmentations during LP and then FT with augmentations may be particularly effective for trade-off mitigation, and hypothesize and empirical see that appropriate pairing of data augmentation and protocol can substantially mitigate this trade-offs.
Two-Stage Fine-Tuning: A Novel Strategy for Learning Class-Imbalanced Data
TLDR
A two-stage fine-tuning is proposed: first fine-tune the final layer of the pretrained model with class-balanced reweighting loss, and then the standard fine- Tuning is performed, which allows the model to learn an initial representation of the specific task.
Diverse Weight Averaging for Out-of-Distribution Generalization
TLDR
Diverse Weight Averaging is proposed that makes a simple change to this strategy: DiWA averages the weights obtained from several independent training runs rather than from a single run, and highlights the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error.
Contrastive Adapters for Foundation Model Group Robustness
TLDR
Contrastive adapting is proposed, which trains adapters with contrastive learning to bring sample embeddings close to both their ground-truth class embeds and other sample embeds in the same class, which improves group robustness.
Revisiting the Updates of a Pre-trained Model for Few-shot Learning
TLDR
It is demonstrated that care-ful considerations of the details about updating pre-trained models are required for better few-shot performance, and that ne-tuning is better than linear probing as the number of samples increases, regardless of distribution shift.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
TLDR
This work uses only 3 - 5 images of a user-provided concept to represent it through new “words” in the embedding space of a frozen text-to-image model, which can be composed into natural language sentences, guiding personalized creation in an intuitive way.
Discrete Key-Value Bottleneck
TLDR
A model architecture is proposed, building upon a discrete bottleneck containing pairs of separate and learnable (key, value) codes, that reduces the complexity of the hypothesis class and reduces the common vulnerability to non-i.i.d. and non-stationary training distributions.
Test-Time Adaptation via Self-Training with Nearest Neighbor Information
TLDR
This work proposes a novel test-time adaptation method Test-time Adaptation via Self-Training with nearest neighbor information (TAST), based on the idea that a test data and its nearest neighbors in the embedding space of the trained classifier are more likely to have the same label.
Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift
TLDR
It is shown that ID-calibrated ensembles—where they simply ensemble the standard and robust models after calibrating on only ID data—outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy.
...
...

References

SHOWING 1-10 OF 87 REFERENCES
Smallest singular value of a random rectangular matrix
We prove an optimal estimate of the smallest singular value of a random sub‐Gaussian matrix, valid for all dimensions. For an N × n matrix A with independent and identically distributed sub‐Gaussian
Breeds: Benchmarks for subpopulation
  • shift. arXiv,
  • 2020
A Theory of Label Propagation for Subpopulation Shift
TLDR
This work proposes a provably effective framework for domain adaptation based on label propagation based on a simple but realistic expansion assumption, and adapt consistency-based semi-supervised learning methods to domain adaptation settings and gain significant improvements.
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks
TLDR
Side-tuning adapts a pre-trained network by training a lightweight "side" network that is fused with the (unchanged) pre- trained network via summation, which is less prone to overfitting, is asymptotically consistent, and does not suffer from catastrophic forgetting in incremental learning.
Improved Baselines with Momentum Contrastive Learning
TLDR
With simple modifications to MoCo, this note establishes stronger baselines that outperform SimCLR and do not require large training batches, and hopes this will make state-of-the-art unsupervised learning research more accessible.
A Simple Framework for Contrastive Learning of Visual Representations
TLDR
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
Do ImageNet Classifiers Generalize to ImageNet?
TLDR
The results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
Moment Matching for Multi-Source Domain Adaptation
TLDR
A new deep learning approach, Moment Matching for Multi-Source Domain Adaptation (M3SDA), which aims to transfer knowledge learned from multiple labeled source domains to an unlabeled target domain by dynamically aligning moments of their feature distributions.
Do Better ImageNet Models Transfer Better?
TLDR
It is found that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy, and ImageNet features are less general than previously suggested.
...
...