• Publications
  • Influence
Learning Structured Sparsity in Deep Neural Networks
TLDR
The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers.
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
TLDR
This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.
PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning
TLDR
PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.
A novel architecture of the 3D stacked MRAM L2 cache for CMPs
TLDR
This paper stacks MRAM-based L2 caches directly atop CMPs and compares it against SRAM counterparts in terms of performance and energy, and proposes two architectural techniques: read-preemptive write buffer and SRAM-MRAM hybrid L2 cache.
Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement
TLDR
The experimental results show that MRAM stacking offers competitive IPC performance with a large reduction in power consumption compared to SRAM and DRAM counterparts.
Learning Intrinsic Sparse Structures within Long Short-term Memory
TLDR
This work aims to learn structurally-sparse Long Short-Term Memory by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs, by proposing Intrinsic Sparse Structures (ISS) in LSTMs.
MoDNN: Local distributed mobile computing system for Deep Neural Network
TLDR
MoDNN is proposed — a local distributed mobile computing system for DNN applications that can partition already trained DNN models onto several mobile devices to accelerate DNN computations by alleviating device-level computing cost and memory usage.
Faster CNNs with Direct Sparse Convolutions and Guided Pruning
TLDR
An efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns and a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures are developed.
GraphR: Accelerating Graph Processing Using ReRAM
TLDR
The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar and it is shown that this assumption is generally true for a large set of graph algorithms.
Vortex: Variation-aware training for memristor X-bar
TLDR
A novel variation-aware training scheme, namely, Vortex, is invented to enhance the training robustness of memristor crossbar-based NCS by actively compensating the impact of device variations and optimizing the mapping scheme from computations to crossbars.
...
...