• Corpus ID: 211678252

Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

@article{Frankle2021TrainingBA,
  title={Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs},
  author={Jonathan Frankle and David J. Schwab and Ari S. Morcos},
  journal={ArXiv},
  year={2021},
  volume={abs/2003.00152}
}
Batch normalization (BatchNorm) has become an indispensable tool for training deep neural networks, yet it is still poorly understood. Although previous work has typically focused on its normalization component, BatchNorm also adds two per-feature trainable parameters - a coefficient and a bias - whose role and expressive power remain unclear. To study this question, we investigate the performance achieved when training only these parameters and freezing all others at their random… 

Figures and Tables from this paper

Calibrated BatchNorm: Improving Robustness Against Noisy Weights in Neural Networks
TLDR
The statistics of the batch normalization layers are recalculated to calibrate the biased distributions during the inference phase to achieve noise-agnostic robust networks and progress the developments of the analog computing devices in the field of neural networks.
RSO: A Gradient Free Sampling Based Approach For Training Deep Neural Networks
TLDR
Surprisingly, it is found that repeating this process a few times for each weight is sufficient to train a deep neural network and the algorithm obtains a classification accuracy of 98% on MNIST.
Supermasks in Superposition
TLDR
The Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting, is presented and it is found that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks.
Improving robustness against common corruptions by covariate shift adaptation
TLDR
It is argued that results with adapted statistics should be included whenever reporting scores in corruption benchmarks and other out-of-distribution generalization settings, and 32 samples are sufficient to improve the current state of the art for a ResNet-50 architecture.
Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream
TLDR
The total number of supervised weight updates can be substantially reduced using three complementary strategies: first, it is found that only 2% of supervised updates are needed to achieve ~80% of the match to adult ventral stream, while using two orders of magnitude fewer supervised synaptic updates.
Kernel Modulation: A Parameter-Efficient Method for Training Convolutional Neural Networks
TLDR
This work proposes a novel parameter-efficient kernel modulation (KM) method that adapts all parameters of a base network instead of a subset of layers for each new task, and shows that KM delivers up to 9% higher accuracy on the Transfer Learning benchmark.
On Fragile Features and Batch Normalization in Adversarial Training
TLDR
It is found that fragile features can be used to learn models with moderate adversarial robustness, while random features cannot, and that adversarially training only the BN layers from scratch can result in non-trivial adversarian robustness.
Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning
TLDR
It is found that only fine-tuning the trainable weights of the batch normalisation layers leads to similar performance as to fine- Tuning all of the weights, with the added benefit of faster convergence.
Training BatchNorm Only in Neural Architecture Search and Beyond
TLDR
This work proposes a novel composite performance indicator to evaluate networks from three perspectives: expressivity, trainability, and uncertainty, derived from the theoretical property of BatchNorm, and empirically disclose that train-BN-only supernet provides an advantage on convolutions over other operators, cause unfair competition between architectures.
BN-NAS: Neural Architecture Search with Batch Normalization
TLDR
BN-NAS can significantly reduce the time required by model training and evaluation in NAS and a BN-based indicator for predicting subnet performance at a very early training stage is proposed for fast evaluation.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Understanding Batch Normalization
TLDR
It is shown that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization, and contrasts the results against recent findings in random matrix theory, shedding new light on classical initialization schemes and their consequences.
On the importance of single directions for generalization
TLDR
It is found that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.
Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization
TLDR
It is argued that this acceleration is due to the fact that Batch Normalization splits the optimization task into optimizing length and direction of the parameters separately, which allows gradient-based methods to leverage a favourable global structure in the loss landscape.
How Does Batch Normalization Help Optimization?
TLDR
It is demonstrated that such distributional stability of layer inputs has little to do with the success of BatchNorm, and this smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
TLDR
It is shown that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, thegradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly.
Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing
TLDR
This paper proposes to fix almost all layers of a deep convolutional neural network, allowing only a small portion of the weights to be learned, and suggests practical ways to harness it to create more robust and compact representations.
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
TLDR
A reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction is presented, improving the conditioning of the optimization problem and speeding up convergence of stochastic gradient descent.
Are All Layers Created Equal?
TLDR
This study provides further evidence that mere parameter counting or norm accounting is too coarse in studying generalization of deep models, and flatness or robustness analysis of the models needs to respect the network architectures.
Wide Residual Networks
TLDR
This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.
...
1
2
3
4
...