• Corpus ID: 8134165

Shake-Shake regularization of 3-branch residual networks

  title={Shake-Shake regularization of 3-branch residual networks},
  author={Xavier Gastaldi},
The method introduced in this paper aims at helping computer vision practitioners faced with an overfit problem. The idea is to replace, in a 3-branch ResNet, the standard summation of residual branches by a stochastic affine combination. The largest tested model improves on the best single shot published result on CIFAR10 by reaching 2.86% test error. Code is available at https://github.com/ xgastaldi/shake-shake 

Figures and Tables from this paper

Shakedrop Regularization for Deep Residual Learning

This paper proposes a new regularization method called ShakeDrop regularization, inspired by Shake-Shake, and introduces a training stabilizer, which is an unusual use of an existing regularizer.

SMASH: One-Shot Model Architecture Search through HyperNetworks

A technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture is proposed, achieving competitive performance with similarly-sized hand-designed networks.

An overview of mixing augmentation methods and augmentation strategies

This review mainly covers the methods published in the materials of top-tier conferences and in leading journals in the years 2017–2021, and focuses on two DA research streams: image mixing and automated selection of augmentation strategies.

Removing the Feature Correlation Effect of Multiplicative Noise

It is shown that NCMN significantly improves the performance of standard multiplicative noise on image classification tasks, providing a better alternative to dropout for batch-normalized networks and a unified view of NCMN and shake-shake regularization, which explains the performance gain of the latter.

Epsilon Consistent Mixup: An Adaptive Consistency-Interpolation Tradeoff

In this paper we propose -Consistent Mixup ( mu). mu is a data-based structural regularization technique that combines Mixup’s linear interpolation with consistency regularization in the Mixup

InAugment: Improving Classifiers via Internal Augmentation

A novel augmentation operation, InAugment, that exploits image internal statistics that improves the model’s accuracy and confidence but its performance on out-of-distribution images is suggested.

Weight asynchronous update: Improving the diversity of filters in a deep convolutional network

This work proposes a new training strategy, weight asynchronous update, which helps to significantly increase the diversity of filters and enhance the representation ability of the network and shows that the stochastic subset of filters updated in different iterations can significantly reduce filter overlap in convolutional networks.

Faster AutoAugment: Learning Augmentation Strategies using Backpropagation

This paper proposes a differentiable policy search pipeline for data augmentation, which achieves significantly faster searching than prior work without a performance drop and introduces approximate gradients for several transformation operations with discrete parameters.

Trainable Weight Averaging for Fast Convergence and Better Generalization

Trainable Weight Averaging (TWA) is proposed, essentially a novel training method in a reduced subspace spanned by historical solutions that largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails.



Shakeout: A New Regularized Deep Neural Network Training Scheme

This paper presents a new training scheme: Shakeout, which leads to a combination of L1 regularization and L2 regularization imposed on the weights, which has been proved effective by the Elastic Net models in practice.

Aggregated Residual Transformations for Deep Neural Networks

On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.

SGDR: Stochastic Gradient Descent with Restarts

This paper proposes a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks and empirically study its performance on CIFar-10 and CIFAR-100 datasets.

Deep Networks with Stochastic Depth

Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation.

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly is given and several new streamlined architectures for both residual and non-residual Inception Networks are presented.

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Adding Gradient Noise Improves Learning for Very Deep Networks

This paper explores the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which it is found surprisingly effective when training these very deep architectures.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.