• Publications
  • Influence
Federated Learning with Non-IID Data
This work presents a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices, and shows that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data. Expand
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs
CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices are presented. Expand
Hello Edge: Keyword Spotting on Microcontrollers
It is shown that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy, and the depthwise separable convolutional neural network (DS-CNN) is explored and compared against other neural network architecture. Expand
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network
This work designs Bit Fusion, a bit-flexible accelerator that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers, and compares it to two state-of-the-art DNN accelerators, Eyeriss and Stripes. Expand
PrivyNet: A Flexible Framework for Privacy-Preserving Deep Neural Network Training with A Fine-Grained Privacy Control
PrivyNet, a flexible framework to enable DNN training on the cloud while protecting the data privacy simultaneously, is proposed and validated, demonstrating that PrivyNet is efficient and can help explore and optimize the trade-off between privacy loss and accuracy. Expand
Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations
It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design. Expand
SlackProbe: A Flexible and Efficient In Situ Timing Slack Monitoring Methodology
Sl SlackProbe methodology is proposed, which inserts timing slack monitors like probes at a selected set of nets, including intermediate nets along critical paths, to detect impending delay failures due to various reasons and can be used with various preventive actions. Expand
DDRO: A novel performance monitoring methodology based on design-dependent ring oscillators
This work develops a systematic approach to the synthesis of multiple design-dependent monitors, as well as a corresponding delay estimation method that reduces overestimation (timing margin) by up to 25% compared to use of a single DDRO. Expand
Synthesis and Analysis of Design-Dependent Ring Oscillator (DDRO) Performance Monitors
This work develops a systematic approach for the synthesis of multiple design-dependent monitors, as well as the corresponding calibration and delay estimation methods forDesign-dependent ring oscillators (DDROs) using standard-cell library gates and conventional physical implementation flows. Expand
Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks
  • Lei Yang, Zheyu Yan, +6 authors Yiyu Shi
  • Computer Science, Engineering
  • 57th ACM/IEEE Design Automation Conference (DAC)
  • 10 February 2020
This paper builds ASIC template set based on existing successful designs, described by their unique dataflows, so that the design space is significantly reduced and proposes a framework, namely ASICNAS, which can simultaneously identify multiple DNN architectures and the associated heterogeneous ASIC accelerator design. Expand