• Corpus ID: 155092148

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

  title={Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks},
  author={Guangyong Chen and Pengfei Chen and Yujun Shi and Chang-Yu Hsieh and B. Liao and Shengyu Zhang},
In this work, we propose a novel technique to boost training efficiency of a neural network. Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent. However, determining independent components is a computationally intensive task. To… 

Figures and Tables from this paper

Generalizing MLPs With Dropouts, Batch Normalization, and Skip Connections
It is empirically show that by whitening inputs before every linear layer and adding skip connections, the proposed MLP architecture can result in better performance.
Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization
An analytic framework based on convex duality is introduced to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time and shows that optimal layer weights can be obtained as simple closed-form for-mulas in the high-dimensional and/or overparameterized regimes.
Multirate Training of Neural Networks
This paper proposes multirate training of neural networks: partitioning neural network parameters into “fast” and “slow” parts which are trained simultaneously using different learning rates, and proposes an additional multirates technique which can learn different features present in the data by training the full network on different time scales simultaneously.
Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network
This work explores the effect of the used optimization algorithm and regularization techniques on the final generalization performance of the model with convolutional neural network (CNN) architecture widely used in the field of computer vision.
Maximum Relevance Minimum Redundancy Dropout with Informative Kernel Determinantal Point Process
This work proposes an efficient end-to-end dropout algorithm that selects the most informative neurons with the highest correlation with the target output considering the sparsity in its selection procedure and introduces the novel DPPMI dropout that adaptively adjusts the retention rate of neurons based on their contribution to the neural network task.
In Defense of Dropout
There is great promise in the results showing better accuracy when using dynamic IC layers compared to baseline models, or when using static IC layers, and ways of extending this general concept to that of the domain of action recognition are suggested.
Learning a Multi-scale Deep Residual Network of Dilated-Convolution for Image Denoising
A multi-scale trainable deep residual convolutional neural network (DCMSNet) based on dilated convolution that is able to remove different degrees of noise and has achieved relatively competitive results.
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
A unified picture of the main motivation behind different approaches from the perspective of optimization is provided, and a taxonomy for understanding the similarities and differences between them is presented.
Uses and Abuses of the Cross-Entropy Loss: Case Studies in Modern Deep Learning
This work identifies the potential for outperformance in categorical cross-entropy loss models, thereby highlighting the importance of a proper probabilistic treatment, as well as illustrating some of the failure modes thereof.
HCR-Net: A deep learning based script independent handwritten character recognition network
A novel deep learning architecture which exploits transfer learning and image-augmentation for end-to-end learning for script independent handwritten character recognition, called HCR-Net, which is based on a novel transfer learning approach for HCR.


Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Understanding Batch Normalization
It is shown that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization, and contrasts the results against recent findings in random matrix theory, shedding new light on classical initialization schemes and their consequences.
Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift
The uncovered mechanism causes the unstable numerical behavior in inference that leads to erroneous predictions finally in inference, and the large feature dimension in WRN further reduces the ``variance shift'' to bring benefits to the overall performance.
Layer Normalization
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called
Decorrelated Batch Normalization
This work proposes Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them, and shows that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks.
Group Normalization
Group Normalization (GN) is presented as a simple alternative to BN that can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks.
How Does Batch Normalization Help Optimization?
It is demonstrated that such distributional stability of layer inputs has little to do with the success of BatchNorm, and this smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
  • S. Ioffe
  • Computer Science, Biology
  • 2017
This work proposes Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch.
Dropout: a simple way to prevent neural networks from overfitting
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.