• Publications
  • Influence
SCNN: An accelerator for compressed-sparse convolutional neural networks
The Sparse CNN (SCNN) accelerator architecture is introduced, which improves performance and energy efficiency by exploiting thezero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator.
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks
A high-performance virtualization strategy based on a "compressing DMA engine" (cDMA) that drastically reduces the size of the data structures that are targeted for CPU-side allocations is introduced.
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
This paper presents a vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations, populated inside a GPU-centric system interconnect as a remote memory pool.
PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units
A case is made for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput.
Priority-based cache allocation in throughput processors
A priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache is proposed.
A locality-aware memory hierarchy for energy-efficient GPU architectures
This work designs and evaluates a locality-aware memory hierarchy for throughput processors, such as GPUs, that retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine- grained access to memory.
Architecting an Energy-Efficient DRAM System for GPUs
This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors that exploits the hierarchical organization of a DRAM bank to reduce the minimum row activation granularity and can support parallel operations across the semi-independent subchannels.
The dual-path execution model for efficient GPU control flow
  • Minsoo Rhu, M. Erez
  • Computer Science
    IEEE 19th International Symposium on High…
  • 23 February 2013
This work proposes a change to the stack hardware in which the execution of two different paths can be interleaved and shows how dual-path execution can be implemented with only modest changes to current hardware and that parallelism is increased without sacrificing optimal (structured) control-flow reconvergence.
Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations
Centaur is presented, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers.