Corpus ID: 221971208

Scalable Transfer Learning with Expert Models

@article{Puigcerver2021ScalableTL,
  title={Scalable Transfer Learning with Expert Models},
  author={J. Puigcerver and Carlos Riquelme and Basil Mustafa and C{\'e}dric Renggli and Andr{\'e} Susano Pinto and S. Gelly and Daniel Keysers and N. Houlsby},
  journal={ArXiv},
  year={2021},
  volume={abs/2009.13239}
}
Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant… Expand
Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types
TLDR
This paper carries out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Expand
Deep Ensembles for Low-Data Transfer Learning
TLDR
This work shows that the nature of pre-training itself is a performant source of diversity, and proposes a practical algorithm that efficiently identifies a subset ofPre-trained models for any downstream dataset and achieves state-of-the-art performance at a much lower inference budget. Expand
Representation Consolidation for Training Expert Students
TLDR
It is shown that a multi-head, multi-task distillation method using an unlabeled proxy dataset and a generalist teacher is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance, outperforming the teacher and the strong baseline of ImageNet pretrained features. Expand
Which Model to Transfer? Finding the Needle in the Growing Haystack
TLDR
This work conducts a large-scale empirical study and shows that both task-agnostic and task-aware methods can yield high regret, and proposes a simple and computationally efficient hybrid search strategy which outperforms the existing approaches. Expand
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
TLDR
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and shows large sparse models may be trained, for the first time, with lower precision formats. Expand
Self-Supervised Pretraining Improves Self-Supervised Pretraining
TLDR
H Hierarchical PreTraining (HPT) is explored, which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model, and provides a simple framework for obtaining better pretrained representations with less computational resources. Expand
2 2 A pr 2 02 1 ImageNet-21 K Pretraining for the Masses
  • 2021
SelfAugment: Automatic Augmentation Policies for Self-Supervised Learning.
TLDR
This work shows that evaluating the learned representations with a self-supervised image rotation task is highly correlated with a standard set of supervised evaluations, and provides an algorithm (SelfAugment) to automatically and efficiently select augmentation policies without using supervised evaluations. Expand
Evaluating Self-Supervised Pretraining Without Using Labels
TLDR
This work explores the idea of using unsupervised evaluation criteria to help both researchers and practitioners make decisions when training without labeled data, and establishes this correlation across hundreds of augmentation policies and training schedules. Expand
Sequential Random Network for Fine-grained Image Classification
TLDR
The proposed SRN, which composed of BiLSTM and several Tanh-Dropout blocks (called BiL STM-TDN), is used to further process DCNN one-dimensional features for highlighting the detail information of image and is far superior to the existing state-of-the-art methods. Expand
...
1
2
...

References

SHOWING 1-10 OF 78 REFERENCES
Domain Adaptive Transfer Learning with Specialist Models
TLDR
It is found that more pre- Training data does not always help, and transfer performance depends on a judicious choice of pre-training data, and domain adaptive transfer learning is proposed, a simple and effective pre- training method using importance weights computed based on the target dataset. Expand
Big Transfer (BiT): General Visual Representation Learning
TLDR
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. Expand
Parameter-Efficient Transfer Learning for NLP
TLDR
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task. Expand
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visualExpand
The Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visualExpand
Efficient Parametrization of Multi-domain Deep Neural Networks
TLDR
This paper proposes to consider universal parametric families of neural networks, which still contain specialized problem-specific models, but differing only by a small number of parameters, and shows that these universal parametrization are very effective for transfer learning, where they outperform traditional fine-tuning techniques. Expand
Incremental Learning Through Deep Adaptation
TLDR
This work proposes a method called Deep Adaptation Modules (DAM) that constrains newly learned filters to be linear combinations of existing ones, and reduces the parameter cost to around 3 percent of the original with negligible or no loss in accuracy. Expand
Learning Factored Representations in a Deep Mixture of Experts
TLDR
The Mixtures of Experts is extended to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts, which exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size. Expand
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
TLDR
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. Expand
Learning multiple visual domains with residual adapters
TLDR
This paper develops a tunable deep network architecture that, by means of adapter residual modules, can be steered on the fly to diverse visual domains and introduces the Visual Decathlon Challenge, a benchmark that evaluates the ability of representations to capture simultaneously ten very differentVisual domains and measures their ability to recognize well uniformly. Expand
...
1
2
3
4
5
...