• Corpus ID: 222310619

Revisiting BFloat16 Training

@article{Zamirai2020RevisitingBT,
  title={Revisiting BFloat16 Training},
  author={Pedram Zamirai and Jian Zhang and Christopher R. Aberger and Christopher De Sa},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.06192}
}
State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit precision alone is not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit compute units which is more costly than only using 16-bit units for hardware design. We ask can we do pure 16-bit training which requires only 16-bit compute units, while still matching the model accuracy attained by 32… 

FPnew: An Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

FPnew is presented, a highly configurable open-source transprecision floating-point unit (TP-FPU), capable of supporting a wide range of standard and custom FP formats, and integrated into a 64-bit RISC-V core, supporting five FP formats on scalars or 2, 4, or 8-way SIMD vectors.

Low-Precision Reinforcement Learning

This paper proposes a set of six modifications, all straightforward to implement, that leaves the underlying agent unchanged but improves its numerical stability dramatically and has lower memory and compute requirements while matching full-precision rewards, thus demonstrating the feasibility of lowprecision RL.

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training

The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key, so the impact of fixed- versus floating-point representations, multiplier rounding, and floating- point exceptional value support is investigated.

Design of Synthesis-time Vectorized Arithmetic Hardware for Tapered Floating-point Addition and Subtraction

The design of a vectorized floating-point adder/subtractor that supports arbitrary length floating- point formats with varying exponent and mantissa widths is proposed in this paper.

Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold

This work reimplementedAlphaFold and AlphaFold-Multimer in the PyTorch framework, and reproduced their from-scratch training processes with equivalent or better accuracy, and presented Uni-Fold as a thoroughly open-source platform for developing protein folding models beyond AlphaFolds.

Low-Precision Arithmetic for Fast Gaussian Processes

This approach improves the numerical stability and practical performance of conjugate gradients in low-precision over a wide range of settings, enabling GPs to train on 1.8 million data points in 10 hours on a single GPU, without requiring any sparse approximations.

Precision- and Accuracy-Reconfigurable Processor Architectures—An Overview

This tutorial brief gives an overview of existing processor solutions that are reconfigurable or tunable in precision or accuracy of computations, and investigates several application domains, including neural network processing, linear algebra, and approximate computing, where such emerging processor architectures can be beneficially used.

Stochastic rounding: implementation, error analysis and applications

This survey surveys SR by discussing its mathematical properties and probabilistic error analysis, its implementation, and its use in applications, with a focus on machine learning and the numerical solution of differential equations.

Quantization of Weights of Neural Networks with Negligible Decreasing of Prediction Accuracy

A design approach for the memoryless Laplacian source with zero-mean and unit variance is presented, which is based on iterative rule and uses the minimal mean-squared error distortion as a performance criterion.

Design of a 2-Bit Neural Network Quantizer for Laplacian Source

A 2-bit uniform quantization model for Laplacian source is designed, which is competitive to the performance of the other quantization solutions with almost optimal precision and leads to a shorter processing time and faster inference.

References

SHOWING 1-10 OF 43 REFERENCES

Mixed Precision Training With 8-bit Floating Point

This paper proposes a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients, and proposes an enhanced loss scaling method to augment the reduced subnormal range of 8- bit floating point for improved error propagation.

Training Deep Neural Networks with 8-bit Floating Point Numbers

This work demonstrates, for the first time, the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets.

ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning

The ZipML framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quantization would introduce significant bias, and it enables an FPGA prototype that is up to 6.5× faster than an implementation using full 32-bit precision.

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

This paper demonstrates how a decomposition into multiple smaller datatypes can be used to assemble a high-precision result, leveraging the higher precision accumulation of the FMA unit, and examines solution of linear equations formulated in the residual form that allows for iterative refinement.

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bit width parameter gradients, is proposed and can achieve comparable prediction accuracy as 32-bit counterparts.

Mixed Precision Training

This work introduces a technique to train deep neural networks using half precision floating point numbers, and demonstrates that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

Deep Learning with Limited Numerical Precision

The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.

QPyTorch: A Low-Precision Arithmetic Simulation Framework

QPyTorch is a low-precision arithmetic simulation framework built natively in PyTorch that leverages an efficient fused-kernel approach to reduce simulator overhead, which enables simulation of large-scale, realistic problems.

Bfloat16 Processing for Neural Networks

This paper proposes a possible implementation of a BF16 multiply-accumulation operation that relaxes several IEEE Floating-Point Standard features to afford low-cost hardware implementations and shows that this approach achieves the same network-level accuracy as using IEEE single-precision arithmetic ("FP32") for less than half the datapath area cost and with greater throughput.