• Corpus ID: 235358576

Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

  title={Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training},
  author={Dominic Masters and Antoine Labatie and Zach Eaton-Rosen and Carlo Luschi},
Much recent research has been dedicated to improving the efficiency of training and inference for image classification. This effort has commonly focused on explicitly improving theoretical efficiency, often measured as ImageNet validation accuracy per FLOP. These theoretical savings have, however, proven challenging to achieve in practice, particularly on high-performance training accelerators. In this work, we focus on improving the practical efficiency of the state-of-the-art EfficientNet… 

Figures and Tables from this paper

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

This work demonstrates a set of modifications to the structure of a Transformer layer, producing a more efficient architecture, and applies the resulting architecture to language representation learning and demonstrates its superior performance compared to BERT models of different scales.

8-bit Numerical Formats for Deep Neural Networks

An in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference and addresses the trade-offs between these formats and the effect of low-precision arithmetic on non-convex optimization and generalization.

NanoBatch Privacy: Enabling fast Differentially Private learning on the IPU

NanoBatch Privacy is proposed, a lightweight add-on to TFDP to be used on Graphcore IPUs by leveraging batch size of 1 (without microbatching) and gradient accumulation to achieve large total batch sizes with minimal impacts to throughput.

Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

This work introduces the technique “Proxy Normalization” that normalizes post-activations using a proxy distribution that emulates batch normalization’s behavior and consistently matches or exceeds its performance.

NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

It is argued that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs and enables DPSGD training of Res net-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.



EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

EfficientNetV2: Smaller Models and Faster Training

An improved method of progressive learning, which adaptively adjusts regularization (e.g., dropout and data augmentation) along with image size is proposed, which significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets.

Characterizing signal propagation to close the performance gap in unnormalized ResNets

A simple set of analysis tools to characterize signal propagation on the forward pass is proposed, and this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth.

High-Performance Large-Scale Image Recognition Without Normalization

An adaptive gradient clipping technique is developed which overcomes instabilities in batch normalization, and a significantly improved class of Normalizer-Free ResNets is designed which attain significantly better performance when finetuning on ImageNet.

Training Deep Nets with Sublinear Memory Cost

This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size

This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network

Detailed experiments to validate that carefully assembling these techniques and applying them to basic CNN models can improve the accuracy and robustness of the models while minimizing the loss of throughput showed that the improvement to backbone network performance boosted transfer learning performance significantly.