• Publications
  • Influence
SCNN: An accelerator for compressed-sparse convolutional neural networks
TLDR
This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training andzero-valued activations that arise from the common ReLU operator. Expand
  • 538
  • 88
  • PDF
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
TLDR
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. Expand
  • 190
  • 33
  • PDF
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks
TLDR
We introduce a high-performance virtualization strategy based on a "compressing DMA engine" (cDMA) that drastically reduces the size of the data structures that are targeted for CPU-side allocations, improving the performance of virtualized DNNs by an average 53% when evaluated on an NVIDIA Titan Xp. Expand
  • 89
  • 17
  • PDF
Priority-based cache allocation in throughput processors
TLDR
We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. Expand
  • 71
  • 9
  • PDF
A locality-aware memory hierarchy for energy-efficient GPU architectures
TLDR
Our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications with irregular control flow and memory access patterns. Expand
  • 96
  • 8
  • PDF
The dual-path execution model for efficient GPU control flow
  • Minsoo Rhu, M. Erez
  • Computer Science
  • IEEE 19th International Symposium on High…
  • 23 February 2013
TLDR
Dual-path execution can be implemented with only modest changes to current hardware and that parallelism is increased without sacrificing optimal (structured) control-flow reconvergence. Expand
  • 52
  • 8
  • PDF
Architecting an Energy-Efficient DRAM System for GPUs
TLDR
This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. Expand
  • 48
  • 7
  • PDF
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
TLDR
We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations. Expand
  • 31
  • 6
  • PDF
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
TLDR
This paper presents an in-depth analysis on the causes for ineffective compaction. Expand
  • 46
  • 3
  • PDF
CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
  • Minsoo Rhu, M. Erez
  • Computer Science
  • 39th Annual International Symposium on Computer…
  • 9 June 2012
TLDR
This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Expand
  • 55
  • 2
  • PDF
...
1
2
3
4
...