Origami: A Convolutional Network Accelerator

@article{Cavigelli2015OrigamiAC,
  title={Origami: A Convolutional Network Accelerator},
  author={Lukas Cavigelli and David Gschwend and Christoph Mayer and Samuel Willi and Beat Muheim and Luca Benini},
  journal={Proceedings of the 25th edition on Great Lakes Symposium on VLSI},
  year={2015}
}
Today advanced computer vision (CV) systems of ever increasing complexity are being deployed in a growing number of application scenarios with strong real-time and power constraints. Current trends in CV clearly show a rise of neural network-based algorithms, which have recently broken many object detection and localization records. These approaches are very flexible and can be used to tackle many different challenges by only changing their parameters. In this paper, we present the first… Expand
Origami: A 803-GOp/s/W Convolutional Network Accelerator
  • L. Cavigelli, L. Benini
  • Computer Science
  • IEEE Transactions on Circuits and Systems for Video Technology
  • 2017
TLDR
A new architecture, design, and implementation, as well as the first reported silicon measurements of such an accelerator, outperforming previous work in terms of power, area, and I/O efficiency are presented. Expand
Towards energy-efficient convolutional neural network inference
TLDR
This thesis first evaluates the capabilities of off-the-shelf software-programmable hardware before diving into specialized hardware accelerators and exploring the potential of extremely quantized CNNs, and gives special consideration to external memory bandwidth. Expand
YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights
TLDR
A HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V is presented. Expand
Extended Bit-Plane Compression for Convolutional Neural Network Accelerators
  • L. Cavigelli, L. Benini
  • Computer Science
  • 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)
  • 2019
TLDR
This work introduces and evaluates a novel, hardware-friendly compression scheme for the feature maps present within convolutional neural networks and shows that an average compression ratio of 4.4× relative to uncompressed data and a gain of 60% over existing method can be achieved for ResNet-34 with a compression block requiring <300 bit of sequential cells and minimal combinational logic. Expand
YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration
TLDR
This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage. Expand
A high utilization FPGA-based accelerator for variable-scale convolutional neural network
TLDR
An optimization framework to solve boundary problem and connect the accelerator with ARM processors and DDR4 memory through dual Advanced eXtensible Interface (AXI) bus is proposed. Expand
Snowflake: An efficient hardware accelerator for convolutional neural networks
TLDR
Snowflake is presented, a scalable, efficient low-power accelerator that is agnostic to CNN architectures that is able to achieve an average computational efficiency of 91% which is significantly higher than comparable architectures. Expand
Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine
TLDR
Hyperdrive is presented: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for an arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level. Expand
A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC
TLDR
This work presents an accelerator architecture, suitable to be implemented on mid-to high-range FPGA devices, that can be re-configured at runtime to adapt to different filter sizes in different convolution layers, and achieves up to 120 GMAC/s (16 bit precision) when executing 5×5 filters. Expand
A 20 TOp/s/W Binary Neural Network Accelerator
  • X. Huang, Yuteng Zhou
  • Computer Science
  • 2019 IEEE International Symposium on Circuits and Systems (ISCAS)
  • 2019
TLDR
The proposed low power hardware architecture of BNN enables deep learning on mobile embedded platforms and achieves a power efficiency of 20 TOp/s/w, far exceeding most of the mainstream CNN chips. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Accelerating real-time embedded scene labeling with convolutional networks
TLDR
This paper presents an optimized convolutional network implementation suitable for real-time scene labeling on embedded platforms and demonstrates that for scene labeling this approach achieves a 1.5x improvement in throughput when compared to a modern desktop CPU at a power budget of only 11 W. Expand
A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters
TLDR
This work proposes to augment many-core architectures using shared-memory clusters of power-optimized RISC processors with Hardware Convolution Engines (HWCEs): ultra-low energy coprocessors for accelerating convolutions, the main building block of many brain-inspired computer vision algorithms. Expand
A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks
TLDR
The nn-X system is presented, a scalable, low-power coprocessor for enabling real-time execution of deep neural networks, able to achieve a peak performance of 227 G-ops/s, which translates to a performance per power improvement of 10 to 100 times that of conventional mobile and desktop processors. Expand
CNP: An FPGA-based processor for Convolutional Networks
TLDR
The implementation exploits the inherent parallelism of ConvNets and takes full advantage of multiple hardware multiplyaccumulate units on the FPGA and can be used for low-power, lightweight embedded vision systems for micro-UAVs and other small robots. Expand
NeuFlow: A runtime reconfigurable dataflow processor for vision
In this paper we present a scalable dataflow hardware architecture optimized for the computation of general-purpose vision algorithms — neuFlow — and a dataflow compiler — luaFlow — that transformsExpand
DRAM or no-DRAM? Exploring linear solver architectures for image domain warping in 28 nm CMOS
TLDR
The results emphasize that DRAM-free accelerators are an attractive choice in terms of power consumption and overall system complexity, even though they require more logic silicon area when compared to accelerators that make use of external DRAM. Expand
Caffe: Convolutional Architecture for Fast Feature Embedding
TLDR
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Expand
NeuFlow: Dataflow vision processing system-on-a-chip
TLDR
The neuFlow SoC was designed to accelerate neural networks and other complex vision algorithms based on large numbers of convolutions and matrix-to-matrix operations and post-layout characterization shows that the system delivers up to 320 GOPS with an average power consumption of 0.6 W. Expand
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual RecognitionExpand
cuDNN: Efficient Primitives for Deep Learning
TLDR
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms. Expand
...
1
2
3
...