Share This Author
Federated Learning with Non-IID Data
- Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, V. Chandra
- Computer ScienceArXiv
- 2 June 2018
This work presents a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices, and shows that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs
CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices are presented.
Hello Edge: Keyword Spotting on Microcontrollers
It is shown that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy, and the depthwise separable convolutional neural network (DS-CNN) is explored and compared against other neural network architecture.
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network
- Hardik Sharma, Jongse Park, H. Esmaeilzadeh
- Computer ScienceInternational Symposium on Computer Architecture
- 5 December 2017
This work designs Bit Fusion, a bit-flexible accelerator that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers, and compares it to two state-of-the-art DNN accelerators, Eyeriss and Stripes.
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
- Naveen Suda, V. Chandra, Yu Cao
- Computer ScienceSymposium on Field Programmable Gate Arrays
- 21 February 2016
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth.
Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
- Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, S. Vrudhula
- Computer ScienceInternational Conference on Field-Programmable…
- 1 August 2016
This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
PrivyNet: A Flexible Framework for Privacy-Preserving Deep Neural Network Training with A Fine-Grained Privacy Control
PrivyNet, a flexible framework to enable DNN training on the cloud while protecting the data privacy simultaneously, is proposed and validated, demonstrating that PrivyNet is efficient and can help explore and optimize the trade-off between privacy loss and accuracy.
Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations
It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design.
Enabling Deep Learning at the LoT Edge
- Liangzhen Lai, Naveen Suda
- Computer ScienceIEEE/ACM International Conference on Computer…
- 5 November 2018
CMSIS-NN, a library of optimized software kernels to enable deployment of NNs on Cortex-M cores is introduced, and techniques for NN algorithm exploration are presented to develop light-weight models suitable for resource constrained systems.
Dream Distillation: A Data-Independent Model Compression Framework
Model compression is eminently suited for deploying deep learning on IoT-devices. However, existing model compression techniques rely on access to the original or some alternate dataset. In this…