• Corpus ID: 8134165

Shake-Shake regularization of 3-branch residual networks

  title={Shake-Shake regularization of 3-branch residual networks},
  author={Xavier Gastaldi},
  booktitle={International Conference on Learning Representations},
  • Xavier Gastaldi
  • Published in
    International Conference on…
    17 February 2017
  • Computer Science
The method introduced in this paper aims at helping computer vision practitioners faced with an overfit problem. The idea is to replace, in a 3-branch ResNet, the standard summation of residual branches by a stochastic affine combination. The largest tested model improves on the best single shot published result on CIFAR10 by reaching 2.86% test error. Code is available at https://github.com/ xgastaldi/shake-shake 

Figures and Tables from this paper

HybridNet: Classification and Reconstruction Cooperation for Semi-Supervised Learning

A new model for leveraging unlabeled data to improve generalization performances of image classifiers: a two-branch encoder-decoder architecture called HybridNet, able to outperform state-of-the-art results on CIFAR-10, SVHN and STL-10 in various semi-supervised settings.

SMASH: One-Shot Model Architecture Search through HyperNetworks

A technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture is proposed, achieving competitive performance with similarly-sized hand-designed networks.

An overview of mixing augmentation methods and augmentation strategies

This review mainly covers the methods published in the materials of top-tier conferences and in leading journals in the years 2017–2021, and focuses on two DA research streams: image mixing and automated selection of augmentation strategies.

A Two-Stage Shake-Shake Network for Long-Tailed Recognition of SAR Aerial View Objects

A two-stage shake-shake network is proposed to tackle the long-tailed learning problem and decouples the learning procedure into the representation learning stage and the classification learning stage to improve the accuracy.

InAugment: Improving Classifiers via Internal Augmentation

A novel augmentation operation, InAugment, that exploits image internal statistics that improves the model’s accuracy and confidence but its performance on out-of-distribution images is suggested.

Faster AutoAugment: Learning Augmentation Strategies using Backpropagation

This paper proposes a differentiable policy search pipeline for data augmentation, which achieves significantly faster searching than prior work without a performance drop and introduces approximate gradients for several transformation operations with discrete parameters.

Trainable Weight Averaging for Fast Convergence and Better Generalization

Trainable Weight Averaging (TWA) is proposed, essentially a novel training method in a reduced subspace spanned by historical solutions that largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails.

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

This paper proposes an effective method to improve the model generalization by additionally penalizing the gradient norm of loss function during optimization, and shows that the recent sharpness-aware minimization method is a special, but not the best, case of this method.

FreezeOut: Accelerate Training by Progressively Freezing Layers

This extended abstract proposes to only train the hidden layers for a set portion of the training run, freezing them out one-by-one and excluding them from the backward pass, demonstrating savings of up to 20% wall-clock time during training.



Shakeout: A New Regularized Deep Neural Network Training Scheme

This paper presents a new training scheme: Shakeout, which leads to a combination of L1 regularization and L2 regularization imposed on the weights, which has been proved effective by the Elastic Net models in practice.

Aggregated Residual Transformations for Deep Neural Networks

On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.

Identity Mappings in Deep Residual Networks

The propagation formulations behind the residual building blocks suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.

Deep Networks with Stochastic Depth

Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation.

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly is given and several new streamlined architectures for both residual and non-residual Inception Networks are presented.

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Adding Gradient Noise Improves Learning for Very Deep Networks

This paper explores the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which it is found surprisingly effective when training these very deep architectures.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.