Rethinking the Inception Architecture for Computer Vision

@article{Szegedy2016RethinkingTI,
  title={Rethinking the Inception Architecture for Computer Vision},
  author={Christian Szegedy and Vincent Vanhoucke and Sergey Ioffe and Jonathon Shlens and Zbigniew Wojna},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2016},
  pages={2818-2826}
}
Convolutional networks are at the core of most state of-the-art computer vision solutions for a wide variety of tasks. [] Key Result With an ensemble of 4 models and multi-crop evaluation, we report 3:5% top-5 error and 17:3% top-1 error on the validation set and 3:6% top-5 error on the official test set.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.
Transfer learning in computer vision tasks: Remember where you come from
Attention Based Pruning for Shift Networks
TLDR
Shift Attention Layers are introduced, which extend SLs by using an attention mechanism that learns which shifts are the best at the same time the network function is trained, and are able to outperform vanilla SLs on various object recognition benchmarks while significantly reducing the number of float operations and parameters for the inference.
Learning to Resize Images for Computer Vision Tasks
TLDR
This work shows that the typical linear re sizer can be replaced with learned resizers that can substantially improve performance, and proposes a CNN-based image resizer that creates machine friendly visual manipulations that lead to a consistent improvement of the end task metric over the baseline model.
Analysis and Optimization of Convolutional Neural Network Architectures
TLDR
A model which has only one million learned parameters for an input size of 32x32x3 and 100 classes and which beats the state of the art on the benchmark dataset Asirra, GTSRB, HASYv2 and STL-10 was developed.
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
TLDR
Clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly is given and several new streamlined architectures for both residual and non-residual Inception Networks are presented.
R-MnasNet: Reduced MnasNet for Computer Vision
TLDR
A new architecture, R-MnasNet (Reduced MnasNet), has been introduced which has a model size of 3 MB and is trained on CIFAR-10 and has a validation accuracy of 91.13%.
RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
TLDR
It is indicated that MLP-based models have the potential to replace CNNs by adopting inductive bias and the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage.
Convolutional Recurrent Neural Networks for Better Image Understanding
TLDR
This work introduces a recurrent unit able to keep and process spatial information throughout the network, and shows that its approach based on higher resolution input is better able to detect details of the images such as the precise number of objects, and the presence of smaller objects, while being less sensitive to biases in the label distribution of the training set.
Improving Efficiency in Deep Learning for Large Scale Visual Recognition
TLDR
To reduce the complexity of large scale classification to sub-linear with the number of classes, a probabilistic label tree framework is proposed and the average complexity of the framework is significantly reduced, while the overall accuracy remains the same as in the single complex model.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Scalable Object Detection Using Deep Neural Networks
TLDR
This work proposes a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
TLDR
This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Deeply-Supervised Nets
TLDR
The proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent, and extends techniques from stochastic gradient methods to analyze the algorithm.
Fully convolutional networks for semantic segmentation
TLDR
The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
...
...