• Corpus ID: 16349374

Training deep neural networks with low precision multiplications

@article{Courbariaux2014TrainingDN,
  title={Training deep neural networks with low precision multiplications},
  author={Matthieu Courbariaux and Yoshua Bengio and Jean-Pierre David},
  journal={arXiv: Learning},
  year={2014}
}
Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. [] Key Result For example, it is possible to train Maxout networks with 10 bits multiplications.

Figures and Tables from this paper

Deep Neural Network Training without Multiplications
TLDR
It is shown that ResNet can be trained using an integer-add instruction in place of a floating-point multiplication instruction with competitive classification accuracy and will enable eliminating the multiplications in deep neural-network training and inference.
Low-Precision Floating-Point Schemes for Neural Network Training
TLDR
A simplified model in which both the outputs and the gradients of the neural networks are constrained to power-of-two values, just using 7 bits for their representation is introduced, significantly reducing the training time as well as the energy consumption and memory requirements during the training and inference phases.
Hardware-software codesign of accurate, multiplier-free Deep Neural Networks
TLDR
This work proposes a novel approach to map floating-point based DNNs to 8-bit dynamic fixed-point networks with integer power-of-two weights with no change in network architecture and proposes a hardware accelerator design to achieve low-power, low-latency inference with insignificant degradation in accuracy.
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
TLDR
BinaryConnect is introduced, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated, and near state-of-the-art results with BinaryConnect are obtained on the permutation-invariant MNIST, CIFAR-10 and SVHN.
Deep Learning with Limited Numerical Precision
TLDR
The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.
Low-Precision Batch-Normalized Activations
TLDR
This work introduces a quantization scheme that is compatible with training very deep neural networks, and shows how quantizing the network activations in the middle of each batch-normalization module can greatly reduce the amount of memory and computational power needed.
Handwritten Digit Classification using 8-bit Floating Point based Convolutional Neural Networks
TLDR
This paper presents an approach of using reduced precision (8-bit) floating points for training hand-written characters classifier LeNeT-5 which allows for achieving 97.10% accuracy while reducing the overall space complexity by 75% in comparison to a model using single precision floating points.
Minimizing Power for Neural Network Training with Logarithm-Approximate Floating-Point Multiplier
This paper proposes to adopt logarithm-approximate multiplier (LAM) for multiply-accumulate (MAC) computation in neural network (NN) training engine, where LAM approximates a floating-point
Quantization of Constrained Processor Data Paths Applied to Convolutional Neural Networks
TLDR
A layer-wise quantization heuristic to find a good fixed-point network approximation for platforms without wide accumulation registers is proposed and it is demonstrated that 16-bit accumulators are able to obtain a Top-1 classification accuracy within 1% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks.
Quantization of constrained processor data paths applied to convolutional neural networks
TLDR
A layer-wise quantization heuristic to find a good fixed-point network approximation for platforms without wide accumulation registers is proposed and it is demonstrated that 16-bit accumulators are able to obtain a Top-1 classification accuracy within 1% of the floating-point baselines on the CIFAR10 and ILSVRC2012 image classification benchmarks.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Deep Learning with Limited Numerical Precision
TLDR
The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.
Improving the speed of neural networks on CPUs
TLDR
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study
TLDR
The results show that an MLP-BP network uses less clock cycles and consumes less real estate when compiled in an FXP format, compared with a larger and slower functioning compilation in an FLP format with similar data representation width, in bits, or a similar precision and range.
Backpropagation without Multiplication
The back propagation algorithm has been modified to work without any multiplications and to tolerate computations with a low resolution, which makes it. more attractive for a hardware implementation.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
TLDR
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
A highly scalable Restricted Boltzmann Machine FPGA implementation
TLDR
This paper describes a novel architecture and FPGA implementation that accelerates the training of general RBMs in a scalable manner, with the goal of producing a system that machine learning researchers can use to investigate ever-larger networks.
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
TLDR
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
A fixed point implementation of the backpropagation learning algorithm
TLDR
The convergence results for a test example using fixed point, floating point and hardware implementations of the backpropagation algorithm are presented.
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly
...
...