BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
@article{Umuroglu2018BISMOAS, title={BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing}, author={Yaman Umuroglu and Lahiru Rasnayake and Magnus Sj{\"a}lander}, journal={2018 28th International Conference on Field Programmable Logic and Applications (FPL)}, year={2018}, pages={307-3077} }
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different…
Figures and Tables from this paper
63 Citations
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
- Computer ScienceACM Trans. Reconfigurable Technol. Syst.
- 2019
It is shown how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs, and achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a XILinx UltraScale+ MPSoC.
Improving Memory Access Locality for Vectorized Bit-Serial Matrix Multiplication in Reconfigurable Computing
- Computer Science2019 International Conference on Field-Programmable Technology (ICFPT)
- 2019
This work studies the inherent locality of bit-serial matrix multiplications and proposes a locality-aware scheduling algorithm that eliminates redundant data fetches from memory and improves with up to 76% compared to a schedule that computes each binary matrix multiplication in sequence.
BiS-KM: Enabling Any-Precision K-Means on FPGAs
- Computer ScienceFPGA
- 2020
Bit-Serial K-Means (BiS-KM), a combination of a hybrid memory layout supporting data retrieval at any level of precision, a novel FPGA design based on bit-serial arithmetic, and a modified K- means algorithm tailored to FPGAs, is proposed, providing an almost linear speedup as precision decreases.
Ax-BxP: Approximate Blocked Computation for Precision-reconfigurable Deep Neural Network Acceleration
- Computer ScienceACM Trans. Design Autom. Electr. Syst.
- 2022
A DNN accelerator that embodies approximate blocked computation and a method to determine a suitable approximation configuration for any given DNN are proposed, which achieves improvement in system energy and performance over an 8-bit fixed-point (FxP8) baseline.
N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based Heterogeneous Computing Cores
- Computer ScienceFPGA
- 2022
The proposed accelerator consists of DSP- and LUT-based GEneral Matrix-Multiplication computing cores, which forms the entire computing system in a heterogeneous fashion, which outperforms the state-of-the-art Mix&Match design with latency reduced with higher inference accuracy.
MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks
- Computer Science2021 31st International Conference on Field-Programmable Logic and Applications (FPL)
- 2021
This paper proposes a Mixed Precision FPGA-based Overlay Processor (MP-OPU) to fully leverage the advantages of mixed precision for both conventional and lightweight CNNs.
Hardware-Centric AutoML for Mixed-Precision Quantization
- Computer ScienceInternational Journal of Computer Vision
- 2020
The Hardware-Aware Automated Quantization (HAQ) framework is introduced which automatically determine the quantization policy, and the hardware accelerator’s feedback in the design loop is taken and the implication of different quantization policies are interpreted, which offer insights for both neural network architecture design and hardware architecture design.
Understanding Cache Boundness of ML Operators on ARM Processors
- Computer ScienceArXiv
- 2021
This is the first indetail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors, and explains the gap between computational peak performance, theoretical and measured, and real-world state-of-the-art results.
NengoFPGA: an FPGA Backend for the Nengo Neural Simulator
- Computer Science
- 2019
An embedded Python-capable PYNQ FPGA implementation supported with a Xilinx Vivado High-Level Synthesis (HLS) workflow that allows sub-millisecond implementation of adaptive neural networks with low-latency, direct I/O access to the physical world and a seamless and user-friendly extension to the neural compiler Python package Nengo.
QuTiBench: Benchmarking Neural Networks on Heterogeneous Hardware
- Computer ScienceACM J. Emerg. Technol. Comput. Syst.
- 2019
QuTiBench is a novel multi-tiered benchmarking methodology that supports algorithmic optimizations such as quantization and helps system developers understand the benefits and limitations of these novel compute architectures in regard to specific neural networks and will help drive future innovation.
References
SHOWING 1-10 OF 12 REFERENCES
Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs
- Computer Science2017 27th International Conference on Field Programmable Logic and Applications (FPL)
- 2017
A generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool and is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations.
The Landscape of Parallel Computing Research: A View from Berkeley
- Computer Science
- 2006
The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
A Survey of Techniques for Approximate Computing
- Computer ScienceACM Comput. Surv.
- 2016
A survey of techniques for approximate computing (AC), which discusses strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units, processor components, memory technologies, and so forth, as well as programming frameworks for AC.
Chisel: Constructing hardware in a Scala embedded language
- Computer ScienceDAC Design Automation Conference 2012
- 2012
Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages, is introduced by embedding Chisel in the Scala programming language, raising the level of hardware design abstraction.
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
- Computer ScienceFPGA
- 2017
FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture that implements fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements is presented.
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
- Computer ScienceJ. Mach. Learn. Res.
- 2017
A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Streamlined Deployment for Quantized Neural Networks
- Computer ScienceArXiv
- 2017
This work describes a streamlining flow to convert all QNN inference operations to integer ones and provides techniques based on processing one bit position at a time (bit-serial) to show how QNNs can be efficiently deployed using common bitwise operations.
A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study
- Computer ScienceFPGA
- 2018
This work presents a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads, and illustrates that reduced precision representations such as binary achieve the best performance.
Weighted-Entropy-Based Quantization for Deep Neural Networks
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes a novel method for quantizing weights and activations based on the concept of weighted entropy, which achieves significant reductions in both the model size and the amount of computation with minimal accuracy loss.
Espresso: Efficient Forward Propagation for BCNNs
- Computer ScienceArXiv
- 2017
Espresso provides special convolutional and dense layers for BCNNs, leveraging bit-packing and bit-wise computations for efficient execution, which provide a speed-up of matrix-multiplication routines, and at the same time, reduce memory usage when storing parameters and activations.