• Corpus ID: 1180618

Exploring the Design Space of Deep Convolutional Neural Networks at Large Scale

@article{Iandola2016ExploringTD,
  title={Exploring the Design Space of Deep Convolutional Neural Networks at Large Scale},
  author={Forrest N. Iandola},
  journal={ArXiv},
  year={2016},
  volume={abs/1612.06519}
}
In recent years, the research community has discovered that deep neural networks (DNNs) and convolutional neural networks (CNNs) can yield higher accuracy than all previous solutions to a broad array of machine learning problems. To our knowledge, there is no single CNN/DNN architecture that solves all problems optimally. Instead, the “right” CNN/DNN architecture varies depending on the application at hand. CNN/DNNs comprise an enormous design space. Quantitatively, we find that a small region… 
PositNN: Tapered Precision Deep Learning Inference for the Edge
TLDR
This work proposes an ultra-low precision deep neural network, PositNN, that uses posits during inference that outperforms other low-precision neural networks and a 32-bit floating point baseline network.
Fine-Grained Energy and Performance Profiling framework for Deep Convolutional Neural Networks
TLDR
Surprisingly, it is possible to refine the model to predict the number of SIMD instructions and main memory accesses solely from the application's Multiply-Accumulate (MAC) counts, hence, eliminating the need for actual measurements.
PositNN Framework: Tapered Precision Deep Learning Inference for the Edge
TLDR
This work proposes a deep neural network framework, PositNN, that uses the posit numerical format and exact-dot-product operations during inference that outperforms other low-precision neural networks across all tasks.
Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures
TLDR
Layrub, a runtime data placement strategy that orchestrates the execution of the training process, achieves layer-centric reuse to reduce memory consumption for extreme-scale deep learning that could not previously be run on a single GPU.
Keynote: small neural nets are beautiful: enabling embedded systems with small deep-neural- network architectures
  • Forrest N. Iandola, K. Keutzer
  • Computer Science
    2017 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)
  • 2017
TLDR
This work has found basic principles of design space exploration used to develop embedded microprocessor architectures to be highly applicable to the design of Deep Neural Net architectures and has used these design principles to create a novel Deep Neural net called SqueezeNet that requires only 480KB of storage for its model parameters.
Fine-grained energy profiling for deep convolutional neural networks on the Jetson TX1
TLDR
This work presents a novel evaluation framework for measuring energy and performance for deep neural networks using ARMs Streamline Performance Analyser integrated with standard deep learning frameworks such as Caffe and CuDNNv5.
SyNERGY: An energy measurement and prediction framework for Convolutional Neural Networks on Jetson TX1
TLDR
This work proposes "SyNERGY" a fine-grained energy measurement (that is, at specific layers) and prediction framework for deep neural networks on embedded platforms and finds that it is possible to refine the model to predict the number of SIMD instructions and main memory accesses solely from the application’s Multiply-Accumulate (MAC) counts.
A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks
  • Youjie Li, Jongse Park, N. Kim
  • Computer Science
    2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • 2018
TLDR
This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs) and proposes an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner.
Implementing Efficient, Portable Computations for Machine Learning
TLDR
The Boda framework for implementing artificial neural network computations, based on metaprogramming, specialization, and autotuning is proposed, and in Boda, the development of efficient convolution operations across various types of hardware is explored.
An On-Chip Learning Accelerator for Spiking Neural Networks using STT-RAM Crossbar Arrays
TLDR
This work presents a scheme for implementing learning on a digital non-volatile memory (NVM) based hardware accelerator for Spiking Neural Networks (SNNs), and shows ~20× higher performance per unit Watt per unit area compared to conventional SRAM based design.
...
...

References

SHOWING 1-10 OF 223 REFERENCES
Convolutional neural networks at constrained time cost
  • Kaiming He, Jian Sun
  • Computer Science
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2015
TLDR
This paper investigates the accuracy of CNNs under constrained time cost, and presents an architecture that achieves very competitive accuracy in the ImageNet dataset, yet is 20% faster than “AlexNet” [14] (16.0% top-5 error, 10-view test).
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Striving for Simplicity: The All Convolutional Net
TLDR
It is found that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks.
Accelerating Deep Convolutional Neural Networks Using Specialized Hardware
TLDR
Hardware specialization in the form of GPGPUs, FPGAs, and ASICs offers a promising path towards major leaps in processing capability while achieving high energy efficiency, and combining multiple FPGA over a low-latency communication fabric offers further opportunity to train and evaluate models of unprecedented size and quality.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
TLDR
This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.
SpeeDO : Parallelizing Stochastic Gradient Descent for Deep Convolutional Neural Network
TLDR
SpeeDO uses off-the-shelf hardwares to speed up CNN training, aiming to achieve two goals: improve deployability and cost effectiveness and serve as a benchmark on which software algorithmic approaches can be evaluated and improved.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
TLDR
This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Recent advances in convolutional neural networks
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TLDR
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
...
...