• Corpus ID: 233181508

On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method

@article{Zhang2021OnFPGATW,
  title={On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method},
  author={Kaiqi Zhang and Cole Hawkins and Xiyuan Zhang and Cong Hao and Zheng Zhang},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.03420}
}
Various hardware accelerators have been developed for energy-efficient and realtime inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and… 

Figures and Tables from this paper

Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination

A Bayesian model is developed that supports various low-rank tensor formats and reduces neural network parameters with automatic rank determination during training and a customized Bayesian solver is developed to train large-scale tensorized neural networks.

Hardware-Efficient Mixed-Precision CP Tensor Decomposition

A mixed-precision block stochastic gradient descent (SGD) method to reduce the costs of CP tensor decomposition and can remarkably reduce the memory, computing and energy cost on resource-constraint edge computing devices.

Tensor Shape Search for Optimum Data Compression

The proposed optimization model maximizes the compression ratio of the TT decomposition given an error bound and applies the proposed method for the compression of RGB images.

Large-scale and energy-efficient tensorized optical neural networks on III–V-on-silicon MOSCAP platform

The road map of implementing large-scale ONNs with a similar number of synapses and superior energy efficiency compared to electronic ANNs is pointed out.

Low-Rank+Sparse Tensor Compression for Neural Networks

This work proposes to combine low-rank tensor decomposition with sparse pruning in order to take advantage of both coarse and fine structure for compression in SOTA architectures.

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration

A novel rank-adaptive tensor-based tensorized neural network model is proposed, which offers orders-of-magnitude memory reduction during training and an ultra-low bitwidth quantization method for DNN model compression, achieving the state- of-the-art accuracy under the same compression ratio.

References

SHOWING 1-10 OF 23 REFERENCES

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

BinaryConnect is introduced, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated, and near state-of-the-art results with BinaryConnect are obtained on the permutation-invariant MNIST, CIFAR-10 and SVHN.

Tensorizing Neural Networks

This paper converts the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.

Ultra-Low Precision 4-bit Training of Deep Neural Networks

A novel adaptive Gradient Scaling technique (GradScale) is explored that addresses the challenges of insufficient range and resolution in quantized gradients as well as explores the impact of quantization errors observed during model training.

Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination

A Bayesian model is developed that supports various low-rank tensor formats and reduces neural network parameters with automatic rank determination during training and a customized Bayesian solver is developed to train large-scale tensorized neural networks.

AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs

The proposed AutoDNNchip is a DNN chip generator that can automatically produce both FPGA- and ASIC-based DNNChip implementation from DNNs developed by machine learning frameworks without humans in the loop and can achieve better performance than that of expert-crafted state-of-the-art FPGAs and ASICs.

T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA

This paper proposes a systematic solution to deploy DNNs on embedded FPGAs, which includes a ternarized hardware Deep Learning Accelerator (T-DLA), and a framework for ternary neural network (TNN) training that can significantly compress the DNN parameters down to two bits with little accuracy drop.

FPGA/DNN Co-Design: An Efficient Design Methodology for 1oT Intelligence on the Edge

Results show that the proposed DNN model and accelerator outperform the state-of-the-art FPGA designs in all aspects including Intersection-over-Union (IoU) and energy efficiency.

Compression and Interpretability of Deep Neural Networks via Tucker Tensor Layer: From First Principles to Tensor Valued Back-Propagation

This work introduces a novel and efficient framework for exploiting the multi-way nature of the weight-tensor in order to dramatically reduce the number of DNN parameters, and derives the tensor valued back-propagation algorithm within the TTL framework, by extending the notion of matrix derivatives to tensors.

Tensorized Embedding Layers for Efficient Model Compression

This work introduces a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance.

Tensor-Train Decomposition

The new form gives a clear and convenient way to implement all basic operations efficiently, and the efficiency is demonstrated by the computation of the smallest eigenvalue of a 19-dimensional operator.