Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

  title={Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers},
  author={Jacob R. Stevens and Rangharajan Venkatesan and Steve Dai and Brucek Khailany and Anand Raghunathan},
  journal={2021 58th ACM/IEEE Design Automation Conference (DAC)},
Transformers have transformed the field of natural language processing. Their superior performance is largely attributed to the use of stacked “self-attention” layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. To address this, we propose Softermax, a hardware-friendly softmax design. Softermax consists of base replacement… 

Figures and Tables from this paper

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism
This paper proposes two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs), and shows that 8-bit approximation allows to obtain acceptable accuracy loss below 1.0%.
SimA: Simple Softmax-free Attention for Vision Transformers
A simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple `1-norm instead of using Softmax layer is introduced, which results in on-par accuracy compared to the SOTA models, without any need for Soft Max layer.
A bit-serial architecture for transformer language models with bit-level early termination microarchitectural mechanism, dubbed LeOPArd, that piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning.
SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences
SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation and achieves 17.
I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference
I-ViT, an integer-only quantization scheme for ViTs, is proposed to enable ViTs to perform the entire computational graph of inference with integer operations and bit-shifting and no floating-point operations.
TrimBERT: Tailoring BERT for Trade-offs
This work shows that reducing the number of intermediate layers in BERTBASE results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time.
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference
The proposed framework called NN-LUT can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.


Efficient Precision-Adjustable Architecture for Softmax Function in Deep Learning
A hardware-friendly and precision-adjustable calculation method for softmax is proposed, which can meet different precision requirements in various deep learning (DL) tasks, and results show that the architecture significantly outperforms other works in speed and area.
Online normalizer calculation for softmax
A way to compute classical Softmax with fewer memory accesses is proposed and it is hypothesized that this reduction in memoryAccesses should improve Softmax performance on actual hardware.
Design and Implementation of an Approximate Softmax Layer for Deep Neural Networks
Compared with the state-of-the-art designs, the proposed approximate softmax design consumes significantly less resources and also achieves high performance while maintaining a very high accuracy.
Efficient Softmax Hardware Architecture for Deep Neural Networks
The classification rules of neural network are summarized and a natural logarithmic calculation unit based on the Maclaurin series and the data preprocessing scheme matching them are improved to achieve proper accuracy, good trade-off and strong expansibility.
Towards Fully 8-bit Integer Inference for the Transformer Model
It is shown that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived and achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models
Approximate Computing, specifically targeting the use of Transformers in NLP tasks, proposes a framework to create smaller, faster and in some cases more accurate models that are faster, smaller and/or more accurate, depending on the user's constraints.
MAGNet: A Modular Accelerator Generator for Neural Networks
MAGNet, a modular accelerator generator for neural networks, is proposed and an inference accelerator optimized for image classification application using three different neural networks—AlexNet, ResNet, and DriveNet is designed.
Q8BERT: Quantized 8Bit BERT
This work shows how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4x with minimal accuracy loss and the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
In-datacenter performance analysis of a tensor processing unit
  • N. Jouppi, C. Young, D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.