• Corpus ID: 238354065

Exploring the Limits of Large Scale Pre-training

  title={Exploring the Limits of Large Scale Pre-training},
  author={Samira Abnar and Mostafa Dehghani and Behnam Neyshabur and Hanie Sedghi},
Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with… 

Pre-train, fine-tune, interpolate: a three-stage strategy for domain generalization

The goal of domain generalization is to train models that generalize well to unseen domains by interpolating the featurizer with auxiliary featurizers trained on auxiliary datasets, which improves the performance of existing state-of-the-art models on the DomainBed benchmark.

Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging

This work adapts Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to PLMs and demonstrates that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.

How well do contrastively trained models transfer?

This work observes that different pre-training methods with the same training source transfer similarly given their ImageNet accuracy, and shows that 1-NN can be used to select the best pre- training method without actual finetuning.

No One Representation to Rule Them All: Overlapping Features of Training Methods

A large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets finds that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors.

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

It is found that scaling laws emerge at netuning time in some NLP tasks, and that they can also be ex-ploited for debugging convergence when training large models.

Revisiting Neural Scaling Laws in Language and Vision

A recipe for estimating scaling law parameters reliably from learning curves is presented and it is demonstrated that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation, NMT and language modeling.

Why Do Better Loss Functions Lead to Less Transferable Features?

It is shown that many objectives lead to statistically significant improvements in ImageNet accuracy over vanilla softmax cross-entropy, but the resulting fixed feature extractors transfer substantially worse to downstream tasks, and the choice of loss has little effect when networks are fully fine-tuned on the new tasks.

Broken Neural Scaling Laws

A smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings.

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags to provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems.

Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

Limits of current techniques for effectively leveraging model scale for compositional generalization in semantic parsing evaluations are highlighted, while the analysis also suggests promising directions for future work.



Exploring the Limits of Weakly Supervised Pretraining

This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.

Scaling Laws for Transfer

The effective data “transferred” from pre-training is calculated by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch by a power-law of parameter count and dataset size.

Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

This paper empirically shows that out-of-distribution performance is strongly correlated with in-dist distribution performance for a wide range of models and distribution shifts, and provides a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.

Deep Learning Through the Lens of Example Difficulty

A measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth is introduced and surprising yet simple relationships between the prediction depth of a giveninput and the model’s uncertainty, confidence, accuracy and speed of learning for that data point are revealed.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

Do CIFAR-10 Classifiers Generalize to CIFAR-10?

This work measures the accuracy of CIFAR-10 classifiers by creating a new test set of truly unseen images and finds a large drop in accuracy for a broad range of deep learning models.

Big Transfer (BiT): General Visual Representation Learning

By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.

Predicting the Generalization Gap in Deep Networks with Margin Distributions

This paper proposes a measure based on the concept of margin distribution, which are the distances of training points to the decision boundary, and finds that it is necessary to use margin distributions at multiple layers of a deep network.