BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

  title={BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing},
  author={Yaman Umuroglu and Lahiru Rasnayake and Magnus Sj{\"a}lander},
  journal={2018 28th International Conference on Field Programmable Logic and Applications (FPL)},
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different… 

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

It is shown how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs, and achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a XILinx UltraScale+ MPSoC.

Improving Memory Access Locality for Vectorized Bit-Serial Matrix Multiplication in Reconfigurable Computing

This work studies the inherent locality of bit-serial matrix multiplications and proposes a locality-aware scheduling algorithm that eliminates redundant data fetches from memory and improves with up to 76% compared to a schedule that computes each binary matrix multiplication in sequence.

BiS-KM: Enabling Any-Precision K-Means on FPGAs

Bit-Serial K-Means (BiS-KM), a combination of a hybrid memory layout supporting data retrieval at any level of precision, a novel FPGA design based on bit-serial arithmetic, and a modified K- means algorithm tailored to FPGAs, is proposed, providing an almost linear speedup as precision decreases.

Ax-BxP: Approximate Blocked Computation for Precision-reconfigurable Deep Neural Network Acceleration

A DNN accelerator that embodies approximate blocked computation and a method to determine a suitable approximation configuration for any given DNN are proposed, which achieves improvement in system energy and performance over an 8-bit fixed-point (FxP8) baseline.

N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based Heterogeneous Computing Cores

The proposed accelerator consists of DSP- and LUT-based GEneral Matrix-Multiplication computing cores, which forms the entire computing system in a heterogeneous fashion, which outperforms the state-of-the-art Mix&Match design with latency reduced with higher inference accuracy.

MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks

This paper proposes a Mixed Precision FPGA-based Overlay Processor (MP-OPU) to fully leverage the advantages of mixed precision for both conventional and lightweight CNNs.

Hardware-Centric AutoML for Mixed-Precision Quantization

The Hardware-Aware Automated Quantization (HAQ) framework is introduced which automatically determine the quantization policy, and the hardware accelerator’s feedback in the design loop is taken and the implication of different quantization policies are interpreted, which offer insights for both neural network architecture design and hardware architecture design.

Understanding Cache Boundness of ML Operators on ARM Processors

This is the first indetail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors, and explains the gap between computational peak performance, theoretical and measured, and real-world state-of-the-art results.

NengoFPGA: an FPGA Backend for the Nengo Neural Simulator

An embedded Python-capable PYNQ FPGA implementation supported with a Xilinx Vivado High-Level Synthesis (HLS) workflow that allows sub-millisecond implementation of adaptive neural networks with low-latency, direct I/O access to the physical world and a seamless and user-friendly extension to the neural compiler Python package Nengo.

QuTiBench: Benchmarking Neural Networks on Heterogeneous Hardware

QuTiBench is a novel multi-tiered benchmarking methodology that supports algorithmic optimizations such as quantization and helps system developers understand the benefits and limitations of these novel compute architectures in regard to specific neural networks and will help drive future innovation.



Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs

  • Thomas B. Preußer
  • Computer Science
    2017 27th International Conference on Field Programmable Logic and Applications (FPL)
  • 2017
A generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool and is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations.

The Landscape of Parallel Computing Research: A View from Berkeley

The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

A Survey of Techniques for Approximate Computing

A survey of techniques for approximate computing (AC), which discusses strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units, processor components, memory technologies, and so forth, as well as programming frameworks for AC.

Chisel: Constructing hardware in a Scala embedded language

Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages, is introduced by embedding Chisel in the Scala programming language, raising the level of hardware design abstraction.

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture that implements fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements is presented.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

Streamlined Deployment for Quantized Neural Networks

This work describes a streamlining flow to convert all QNN inference operations to integer ones and provides techniques based on processing one bit position at a time (bit-serial) to show how QNNs can be efficiently deployed using common bitwise operations.

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study

This work presents a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads, and illustrates that reduced precision representations such as binary achieve the best performance.

Weighted-Entropy-Based Quantization for Deep Neural Networks

This paper proposes a novel method for quantizing weights and activations based on the concept of weighted entropy, which achieves significant reductions in both the model size and the amount of computation with minimal accuracy loss.

Espresso: Efficient Forward Propagation for BCNNs

Espresso provides special convolutional and dense layers for BCNNs, leveraging bit-packing and bit-wise computations for efficient execution, which provide a speed-up of matrix-multiplication routines, and at the same time, reduce memory usage when storing parameters and activations.