This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels, which is used to develop deadlocked routing algorithms for k-ary n-cubes, for cube-connected cycles, and for shuffle-exchange networks.
The concept of on-chip networks is introduced, a simple network is sketched, and some challenges in the architecture and design of these networks are discussed.
This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
The Sparse CNN (SCNN) accelerator architecture is introduced, which improves performance and energy efficiency by exploiting thezero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator.
The dragonfly topology is introduced which uses a group of high-radix routers as a virtual router to increase the effective radix of the network and the use of selective virtual-channel discrimination and theUse of credit round-trip latency to both sense and signal channel congestion gives throughput and latency that approaches that of an ideal adaptive routing algorithm.