NPE: An FPGA-based Overlay Processor for Natural Language Processing

  title={NPE: An FPGA-based Overlay Processor for Natural Language Processing},
  author={Hamza Mustafa Khan and Asma Khan and Zainab F. Khan and Lun Bin Huang and Kun Wang and Lei He},
  journal={The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
  • H. Khan, Asma Khan, Lei He
  • Published 17 February 2021
  • Computer Science
  • The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
In recent years, transformer-based models have shown state-of-the-art results for Natural Language Processing (NLP). In particular, the introduction of the BERT language model brought with it breakthroughs in tasks such as question answering and natural language inference, advancing applications that allow humans to interact naturally with embedded devices. FPGA-based overlay processors have been shown as effective solutions for edge image and video processing applications, which mostly rely on… 
Vis-TOP: Visual Transformer Overlay Processor
Vis-TOP (Visual Transformer Overlay Processor), an overlay processor for various visual Transformer models, provides a cost-effective and power-effective solution based on reconfigurable devices for computer vision at the edge in terms of both resource consumption and inference speed.
Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration
A low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations and find an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss.


FTRANS: energy-efficient acceleration of transformers using FPGA
This paper proposes an efficient acceleration framework, Ftrans, for transformer-based large scale language representations, which includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation.
OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks
A domain-specific FPGA overlay processor, named OPU, is proposed to accelerate CNN networks, which offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without reconfiguration of FPGa for switch or update of CNN networks.
Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks
This paper proposes an FPGA-based overlay processor with a corresponding compilation flow for general LW-CNN accelerations, called Light-OPU, which is evaluated using all major LW- CNNs including the newly released MobileNetV3.
Uni-OPU: An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks
This is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TConV, and CONV layers, and it is observed that high acceleration performance is also achieved on Nn- TCONV networks, the acceleration of which have not been explored before.
Q8BERT: Quantized 8Bit BERT
This work shows how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4x with minimal accuracy loss and the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model
This work quantizes a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel Cascade Lake processors to improve inference performance while maintaining less than 0.5% drop in accuracy.
OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator
We present a high-performance Transformer neural network inference accelerator named OPTIMUS. OPTIMUS has several features for performance enhancement such as the redundant computation skipping…
A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation
A3 is designed and architect, which accelerates attention mechanisms in neural networks with algorithmic approximation and hardware specialization and achieves multiple orders of magnitude improvement in energy efficiency (performance/watt) as well as substantial speedup over the state-of-the-art conventional hardware.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
TinyBERT: Distilling BERT for Natural Language Understanding
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large β€œteacher” BERT can be effectively transferred to a small β€œstudent” TinyBERT.