• Publications
  • Influence
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
TLDR
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
TLDR
This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).
Learning both Weights and Connections for Efficient Neural Network
TLDR
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
Principles and Practices of Interconnection Networks
TLDR
This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
EIE: Efficient Inference Engine on Compressed Deep Neural Network
TLDR
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.
Deadlock-Free Message Routing in Multiprocessor Interconnection Networks
TLDR
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels, which is used to develop deadlocked routing algorithms for k-ary n-cubes, for cube-connected cycles, and for shuffle-exchange networks.
Route packets, not wires: on-chip interconnection networks
TLDR
The concept of on-chip networks is introduced, a simple network is sketched, and some challenges in the architecture and design of these networks are discussed.
Memory access scheduling
TLDR
This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
SCNN: An accelerator for compressed-sparse convolutional neural networks
TLDR
The Sparse CNN (SCNN) accelerator architecture is introduced, which improves performance and energy efficiency by exploiting thezero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator.
Technology-Driven, Highly-Scalable Dragonfly Topology
TLDR
The dragonfly topology is introduced which uses a group of high-radix routers as a virtual router to increase the effective radix of the network and the use of selective virtual-channel discrimination and theUse of credit round-trip latency to both sense and signal channel congestion gives throughput and latency that approaches that of an ideal adaptive routing algorithm.
...
...