Exploring the Limits of Large Scale Pre-training
@article{Abnar2021ExploringTL, title={Exploring the Limits of Large Scale Pre-training}, author={Samira Abnar and Mostafa Dehghani and Behnam Neyshabur and Hanie Sedghi}, journal={ArXiv}, year={2021}, volume={abs/2110.02095} }
Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with…
Figures and Tables from this paper
figure 1 figure 2 table 2 figure 3 table 3 figure 4 table 4 figure 5 table 5 figure 6 table 6 figure 7 figure 8 figure 9 figure 10 figure 11 figure 12 figure 13 figure 14 figure 15 figure 16 figure 17 figure 18 figure 19 figure 20 figure 21 figure 22 figure 23 figure 24 figure 25 figure 26 figure 27 figure 28 figure 29 figure 30 figure 31 figure 32 figure 33 figure 34 figure 35 figure 36 figure 37 figure 38 figure 39 figure 40 figure 41 figure 42
47 Citations
Pre-train, fine-tune, interpolate: a three-stage strategy for domain generalization
- Computer Science
- 2022
The goal of domain generalization is to train models that generalize well to unseen domains by interpolating the featurizer with auxiliary featurizers trained on auxiliary datasets, which improves the performance of existing state-of-the-art models on the DomainBed benchmark.
Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging
- Computer ScienceArXiv
- 2022
This work adapts Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to PLMs and demonstrates that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.
How well do contrastively trained models transfer?
- Computer Science
- 2022
This work observes that different pre-training methods with the same training source transfer similarly given their ImageNet accuracy, and shows that 1-NN can be used to select the best pre- training method without actual finetuning.
No One Representation to Rule Them All: Overlapping Features of Training Methods
- Computer ScienceArXiv
- 2021
A large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets finds that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors.
Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments
- Computer ScienceArXiv
- 2022
It is found that scaling laws emerge at netuning time in some NLP tasks, and that they can also be ex-ploited for debugging convergence when training large models.
Revisiting Neural Scaling Laws in Language and Vision
- Computer ScienceArXiv
- 2022
A recipe for estimating scaling law parameters reliably from learning curves is presented and it is demonstrated that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation, NMT and language modeling.
Why Do Better Loss Functions Lead to Less Transferable Features?
- Computer ScienceNeurIPS
- 2021
It is shown that many objectives lead to statistically significant improvements in ImageNet accuracy over vanilla softmax cross-entropy, but the resulting fixed feature extractors transfer substantially worse to downstream tasks, and the choice of loss has little effect when networks are fully fine-tuned on the new tasks.
Broken Neural Scaling Laws
- Computer ScienceArXiv
- 2022
A smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings.
Revisiting Weakly Supervised Pre-Training of Visual Perception Models
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags to provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems.
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing
- Computer ScienceArXiv
- 2022
Limits of current techniques for effectively leveraging model scale for compositional generalization in semantic parsing evaluations are highlighted, while the analysis also suggests promising directions for future work.
References
SHOWING 1-10 OF 40 REFERENCES
Exploring the Limits of Weakly Supervised Pretraining
- Computer ScienceECCV
- 2018
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.
Scaling Laws for Transfer
- Computer ScienceArXiv
- 2021
The effective data “transferred” from pre-training is calculated by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch by a power-law of parameter count and dataset size.
Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization
- Computer ScienceICML
- 2021
This paper empirically shows that out-of-distribution performance is strongly correlated with in-dist distribution performance for a wide range of models and distribution shifts, and provides a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.
Deep Learning Through the Lens of Example Difficulty
- Computer ScienceNeurIPS
- 2021
A measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth is introduced and surprising yet simple relationships between the prediction depth of a giveninput and the model’s uncertainty, confidence, accuracy and speed of learning for that data point are revealed.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- Computer ScienceICML
- 2019
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.
Language Models are Few-Shot Learners
- Computer ScienceNeurIPS
- 2020
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Do CIFAR-10 Classifiers Generalize to CIFAR-10?
- Computer ScienceArXiv
- 2018
This work measures the accuracy of CIFAR-10 classifiers by creating a new test set of truly unseen images and finds a large drop in accuracy for a broad range of deep learning models.
Big Transfer (BiT): General Visual Representation Learning
- Computer ScienceECCV
- 2020
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.
Predicting the Generalization Gap in Deep Networks with Margin Distributions
- Computer ScienceICLR
- 2019
This paper proposes a measure based on the concept of margin distribution, which are the distances of training points to the decision boundary, and finds that it is necessary to use margin distributions at multiple layers of a deep network.