3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration

  title={3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration},
  author={Yaoxing Chen and Cole Hawkins and Kaiqi Zhang and Zheng Zhang and Cong Hao},
  journal={Proceedings of the 2021 on Great Lakes Symposium on VLSI},
The deep neural network (DNN) based AI applications on the edge require both low-cost computing platforms and high-quality services. However, the limited memory, computing resources, and power budget of the edge devices constrain the effectiveness of the DNN algorithms. Developing edge-oriented AI algorithms and implementations (e.g., accelerators) is challenging. In this paper, we summarize our recent efforts for efficient on-device AI development from three aspects, including both training… Expand

Figures and Tables from this paper


Two-Step Quantization for Low-bit Neural Networks
A simple yet effective Two-Step Quantization (TSQ) framework is proposed, by decomposing the network quantization problem into two steps: code learning and transformation function learning based on the learned codes, and the sparse quantization method for code learning. Expand
µL2Q: An Ultra-Low Loss Quantization Method for DNN Compression
This work proposes an effective method, called ultra-low loss quantization (µL2Q), to provide DNN quantization schemes based on comprehensive quantitative data analysis, which builds the transformation of the original data to a data space with standard normal distribution, and finds the optimal parameters to minimize the loss of the quantization of a targeted bit width. Expand
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM
This paper focuses on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network, and proposes to solve this problem using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Expand
DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
DNNBuilder, an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity, is designed and demonstrated. Expand
Ternary neural networks for resource-efficient AI applications
This paper proposes ternary neural networks (TNNs) in order to make deep learning more resource-efficient, and designs a purpose-built hardware architecture for TNNs and implements it on FPGA and ASIC. Expand
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture that implements fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements is presented. Expand
Scalable high-performance architecture for convolutional ternary neural networks on FPGA
This work presents a highly versatile FPGA friendly architecture for TNN in which it can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption. Expand
E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings
This paper attempts to conduct more energy-efficient training of CNNs, so as to enable on-device training, by dropping unnecessary computations from three complementary levels: stochastic mini-batch dropping on the data level; selective layer update on the model level; and sign prediction for low-cost, low-precision back-propagation, on the algorithm level. Expand
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
BinaryConnect is introduced, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated, and near state-of-the-art results with BinaryConnect are obtained on the permutation-invariant MNIST, CIFAR-10 and SVHN. Expand
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
The design of a BNN accelerator is presented that is synthesized from C++ to FPGA-targeted Verilog and outperforms existing FPGAs-based CNN accelerators in GOPS as well as energy and resource efficiency. Expand