Corpus ID: 199064659

An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation

@article{Michalski2019AnES,
  title={An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation},
  author={Vincent Michalski and Vikram S. Voleti and Samira Ebrahimi Kahou and Anthony Ortiz and Pascal Vincent and Chris Pal and Doina Precup},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.00061}
}
Batch normalization has been widely used to improve optimization in deep neural networks. [...] Key Method All these methods utilize a learned affine transformation after the normalization operation to increase representational power. Methods used in conditional computation define the parameters of these transformations as learnable functions of conditioning information. In this work, we study whether and where the conditional formulation of group normalization can improve generalization compared to conditional…Expand
Representative Batch Normalization with Feature Calibration
TLDR
This work proposes to add a simple yet effective feature calibration scheme into the centering and scaling operations of BatchNorm, enhancing the instance-specific representations with the negligible computational cost. Expand
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
TLDR
A unified picture of the main motivation behind different approaches from the perspective of optimization is provided, and a taxonomy for understanding the similarities and differences between them is presented. Expand

References

SHOWING 1-10 OF 47 REFERENCES
Group Normalization
TLDR
Group Normalization can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. Expand
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand
Fixup Initialization: Residual Learning Without Normalization
TLDR
This work proposes fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization that enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation. Expand
Optimization as a Model for Few-Shot Learning
Classification Accuracy Score for Conditional Generative Models
TLDR
This work uses class-conditional generative models from a number of model classes---variational autoencoders, autoregressive models, and generative adversarial networks (GANs)---to infer the class labels of real data and reveals some surprising results not identified by traditional metrics. Expand
Self-Attention Generative Adversarial Networks
TLDR
The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Expand
TADAM: Task dependent adaptive metric for improved few-shot learning
TLDR
This work identifies that metric scaling and metric task conditioning are important to improve the performance of few-shot algorithms and proposes and empirically test a practical end-to-end optimization procedure based on auxiliary task co-training to learn a task-dependent metric space. Expand
Batch Normalization is a Cause of Adversarial Vulnerability
TLDR
Substuting weight decay for batch norm is sufficient to nullify the relationship between adversarial vulnerability and the input dimension and is consistent with a mean-field analysis that found that batch norm causes exploding gradients. Expand
Learning to Learn with Conditional Class Dependencies
TLDR
A meta-learning framework that conditionally transforms feature representations based on a metric space that is trained to capture inter-class dependencies, that enables a conditional modulation of the feature representations of the base-learner to impose regularities informed by the label space is proposed. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
...
1
2
3
4
5
...