Corpus ID: 238252949

ResNet strikes back: An improved training procedure in timm

@article{Wightman2021ResNetSB,
  title={ResNet strikes back: An improved training procedure in timm},
  author={Ross Wightman and Hugo Touvron and Herv'e J'egou},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.00476}
}
The influential Residual Networks designed by He et al. remain the gold-standard architecture in numerous scientific publications. They typically serve as the default architecture in studies, or as baselines when new architectures are proposed. Yet there has been significant progress on best practices for training neural networks since the inception of the ResNet architecture in 2015. Novel optimization & dataaugmentation have increased the effectiveness of the training recipes. In this paper… Expand
Revisiting Batch Normalization
TLDR
This work revisits the BN formulation and present a new initialization method and update approach for BN to address the aforementioned issues and presents a new online BN-based input data normalization technique to alleviate the need for other offline or fixed methods. Expand
MetaFormer is Actually What You Need for Vision
  • Weihao Yu, Mi Luo, +5 authors Shuicheng Yan
  • Computer Science
  • ArXiv
  • 2021
TLDR
It is argued that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks, and calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Expand
On Convergence of Training Loss Without Reaching Stationary Points
TLDR
This work provides numerical evidence that in large-scale neural network training, such as in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network weight variables do not converge to stationary points where the gradient of the loss function vanishes and proposes a new perspective based on ergodic theory of dynamical systems. Expand
GreedyNASv2: Greedier Search with a Greedy Path Filter
  • Tao Huang, Shan You, +4 authors Chang Xu
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper leverages an explicit path filter to capture the characteristics of paths and directly filter those weak ones, so that the search can be thus implemented on the shrunk space more greedily and efficiently. Expand
The Efficiency Misnomer
TLDR
It is demonstrated how incomplete reporting of cost indicators can lead to partial conclusions and a blurred or incomplete picture of the practical considerations of different models, and suggestions to improve reporting of efficiency metrics are presented. Expand
Extrapolating from a Single Image to a Thousand Classes using Distillation
What can neural networks learn about the visual world from a single image? While it obviously cannot contain the multitudes of possible objects, scenes and lighting conditions that exist – within theExpand
ML-Decoder: Scalable and Versatile Classification Head
In this paper, we introduce ML-Decoder, a new attentionbased classification head. ML-Decoder predicts the existence of class labels via queries, and enables better utilization of spatial dataExpand

References

SHOWING 1-10 OF 57 REFERENCES
Revisiting ResNets: Improved Training and Scaling Strategies
TLDR
It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. Expand
AutoAugment: Learning Augmentation Policies from Data
TLDR
This paper describes a simple procedure called AutoAugment to automatically search for improved data augmentation policies, which achieves state-of-the-art accuracy on CIFAR-10, CIFar-100, SVHN, and ImageNet (without additional data). Expand
Aggregated Residual Transformations for Deep Neural Networks
TLDR
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity. Expand
High-Performance Large-Scale Image Recognition Without Normalization
TLDR
An adaptive gradient clipping technique is developed which overcomes instabilities in batch normalization, and a significantly improved class of Normalizer-Free ResNets is designed which attain significantly better performance when finetuning on ImageNet. Expand
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. Expand
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
TLDR
The empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning, and the optimizer enables use of very large batch sizes of 32868 without any degradation of performance. Expand
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. Expand
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Expand
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
Deep Networks with Stochastic Depth
TLDR
Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation. Expand
...
1
2
3
4
5
...