• Publications
  • Influence
Federated Learning with Non-IID Data
TLDR
This work presents a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices, and shows that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data. Expand
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs
TLDR
CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices are presented. Expand
Hello Edge: Keyword Spotting on Microcontrollers
TLDR
It is shown that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy, and the depthwise separable convolutional neural network (DS-CNN) is explored and compared against other neural network architecture. Expand
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
TLDR
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. Expand
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network
TLDR
This work designs Bit Fusion, a bit-flexible accelerator that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers, and compares it to two state-of-the-art DNN accelerators, Eyeriss and Stripes. Expand
Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
TLDR
This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. Expand
PrivyNet: A Flexible Framework for Privacy-Preserving Deep Neural Network Training with A Fine-Grained Privacy Control
TLDR
PrivyNet, a flexible framework to enable DNN training on the cloud while protecting the data privacy simultaneously, is proposed and validated, demonstrating that PrivyNet is efficient and can help explore and optimize the trade-off between privacy loss and accuracy. Expand
Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations
TLDR
It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design. Expand
ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler
TLDR
The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, a 1.9X improvement in throughput when compared to the OpenCL-based design. Expand
Dream Distillation: A Data-Independent Model Compression Framework
Model compression is eminently suited for deploying deep learning on IoT-devices. However, existing model compression techniques rely on access to the original or some alternate dataset. In thisExpand
...
1
2
3
...