LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
  author={Tim Dettmers and Mike Lewis and Younes Belkada and Luke Zettlemoyer},
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and… 

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

The proposed SmoothQuant solution enables an INT8 quantization of both weights and activations for all the GEMM s in LLMs including OPT-175B, BLOOM-176B and GLM-130B and has better hardware efficiency than existing techniques using mixed-precision activation quantization or weight-only quantization.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ is proposed, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient, and can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline.

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

An efficient inference framework for large- scale generative language models, where quantize weights by a non-uniform quantization method and quantized matrix multiplications are accelerated by the proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy.

Efficiently Scaling Transformer Inference

A simple analytical model for inference efficiency is developed to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements and a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks.

Accuracy Boosters: Epoch-Driven Mixed-Mantissa Block Floating-Point for DNN Training

A full-scale exploration of the HBFP design space, including minimal mantissa encoding, varying block sizes, and mixed mantissa bit-width across layers and epochs, and proposes Accuracy Boosters, an epoch-driven mixed-mantissa HBFP that uses 6-bit mantissa only in the last epoch and converts 99 .

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM is a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers and achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning.

Outliers Dimensions that Disrupt Transformers Are Driven by Frequency

It is found that in both BERT and RoBERTa the token frequency, known to contribute to anisotropicity, also contributes to the outlier phenomenon, which contributes tothe “vertical” self-attention pattern that enables the model to focus on the special tokens.

GLM-130B: An Open Bilingual Pre-trained Model

An attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained, including its design choices, training strategies for both efficiency and stability, and engineering efforts is introduced.

Lo-fi: Distributed Fine-tuning without Communication

By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables flne- Tuning in settings with prohibitive communication cost.

On the Influence of Tokenizers in NLP Pipelines

  • Computer Science
  • 2022
This thesis examines the influence of tokenization in NLP pipelines, by analyzing, reproducing, and quantifying claims from the token-free NLP literature, using the example of NER, and concludes that token- free models, like ByT5, offer significant advantages over their tokenizer-based alternatives.



Towards Fully 8-bit Integer Inference for the Transformer Model

It is shown that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived and achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

This work introduces a novel quantization scheme – per-embedding-group quantization, and shows that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss.

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

This work presents F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication, which achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks

This work introduces two learnable statistics of the DNN tensors - shifted and squeezed factors that are used to optimally adjust the range of the tensors in 8-bits, thus minimizing the loss in information due to quantization.

Mixed Precision Training With 8-bit Floating Point

This paper proposes a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients, and proposes an enhanced loss scaling method to augment the reduced subnormal range of 8- bit floating point for improved error propagation.

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

This work is able to show that ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT-3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference.

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

An efficient inference framework for large- scale generative language models, where quantize weights by a non-uniform quantization method and quantized matrix multiplications are accelerated by the proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy.

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

This work proposes a hybrid FP8 (HFP8) format and DNN end-to-end distributed training procedure and demonstrates, using HFP8, the successful training of deep learning models across a whole spectrum of applications including Image Classification, Object Detection, Language and Speech without accuracy degradation.

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

It is discovered that γ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts.

8-bit Optimizers via Block-wise Quantization

This paper develops a fast, high-precision non-linear quantization method – block-wise dynamic quantization – that enables stable 8-bit optimizers which maintain 32-bit performance at a fraction of the memory footprint and without any changes to the original hyperparameters.