8-bit Numerical Formats for Deep Neural Networks

  title={8-bit Numerical Formats for Deep Neural Networks},
  author={Badreddine Noune and Philip Jones and Daniel Justus and Dominic Masters and Carlo Luschi},
Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point representation, and present an in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference. We explore the effect of different bit… 

FP8 Formats for Deep Learning

This paper proposes an 8-bit FP8 binary interchange format consisting of two encodings - E4M3 and E5M2 - and demonstrates the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.

Climate Change Modelling at Reduced Float Precision with Stochastic Rounding

Reduced precision floating point arithmetic is now routinely deployed in numerical weather forecasting over short timescales. However the applicability of these reduced precision techniques to longer



Mixed Precision Training With 8-bit Floating Point

This paper proposes a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients, and proposes an enhanced loss scaling method to augment the reduced subnormal range of 8- bit floating point for improved error propagation.

Rethinking floating point for deep learning

This work improves floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson's posit format.

DLFloat: A 16-b Floating Point Format Designed for Deep Learning Training and Inference

An optimized 16-bit format that has 6 exponent bits and 9 fraction bits, derived from a study of the range of values encountered in DL applications, that preserves the accuracy of DL networks and enables realization of a compact power-efficient computation engine.

Scalable Methods for 8-bit Training of Neural Networks

This work is the first to quantize the weights, activations, as well as a substantial volume of the gradients stream, in all layers (including batch normalization) to 8-bit while showing state-of-the-art results over the ImageNet-1K dataset.

Mixed Precision Training

This work introduces a technique to train deep neural networks using half precision floating point numbers, and demonstrates that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks.

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

BinaryConnect is introduced, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated, and near state-of-the-art results with BinaryConnect are obtained on the permutation-invariant MNIST, CIFAR-10 and SVHN.

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

This work proposes a hybrid FP8 (HFP8) format and DNN end-to-end distributed training procedure and demonstrates, using HFP8, the successful training of deep learning models across a whole spectrum of applications including Image Classification, Object Detection, Language and Speech without accuracy degradation.

Training deep neural networks with low precision multiplications

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications.

Mixed Precision Training of Convolutional Neural Networks using Integer Operations

This work trains state-of-the-art visual understanding neural networks on the ImageNet-1K dataset, with Integer operations on General Purpose (GP) hardware, and proposes a shared exponent representation of tensors and develops a Dynamic Fixed Point (DFP) scheme suitable for common neural network operations.

Deep Learning with Limited Numerical Precision

The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.