Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

@article{Umuroglu2019OptimizingBM,
  title={Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing},
  author={Yaman Umuroglu and Davide Conficconi and Lahiru Rasnayake and Thomas B. Preu{\ss}er and Magnus Sj{\"a}lander},
  journal={ACM Transactions on Reconfigurable Technology and Systems (TRETS)},
  year={2019},
  volume={12},
  pages={1 - 24}
}
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different… 

A Survey of Network-Based Hardware Accelerators

This review paper aims to analyze, compare, and discuss different approaches to implementing network-based hardware accelerators in FPGA and programmable SoC (Systems-on-Chip).

Accelerating Population Count with a Hardware Co-Processor for MicroBlaze

  • I. Skliarova
  • Computer Science
    Journal of Low Power Electronics and Applications
  • 2021
This paper proposes a Field-Programmable Gate Array (FPGA)-based hardware accelerator for assisting the embedded MicroBlaze soft-core processor in calculating population count and demonstrates that the best hardware accelerator with DMA (Direct Memory Access) is ~31 times faster than the best software version running on Micro Blaze.

On the RTL Implementation of FINN Matrix Vector Unit

It is shown that for smaller design parameters, RTL produces significantly smaller circuits as compared to HLS, and the gained benefits in synthesis time together with some design-dependent resource benefits, make the RTL abstraction an attractive alternative.

Understanding Cache Boundness of ML Operators on ARM Processors

This is the first indetail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors, and explains the gap between computational peak performance, theoretical and measured, and real-world state-of-the-art results.

EPIC: An Energy-Efficient, High-Performance GPGPU Computing Research Infrastructure

EPIC is a GPGPU enabled computing research infrastructure at NTNU that enablesNTNU's researchers to perform experiments that otherwise would be impossible, as time-to-solution would simply take too long.

Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAs

This work surveys three leading digital design abstractions: Hardware Description Languages (HDLs), High-Level Synthesis (HLS) tools, and Domain-Specific Languagess (DSLs) and proposes a taxonomy for each abstraction trend.

Dovado: An Open-Source Design Space Exploration Framework

This work proposes Dovado, an open-source CAD tool for design space exploration (DSE) tailored for FPGAs-based designs and proposes an approximation model for the NSGA-II fitness function to decide whether Vivado or a Nadaraya-Watson model should estimate the optimization metrics.

Towards Hardware-Specific Automatic Compression of Neural Networks

It is shown that a joint search and compression using pruning and quantization is superior to an individual search for policies using a single compression method, thus providing automatic compression for neural networks.

AutoQ: Automated Kernel-Wise Neural Network Quantization

This paper proposes a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another Qbn for each activation layer, which reduces the inference latency and decreases the inference energy consumption while achieving the same inference accuracy.

A UTO Q: A UTOMATED K ERNEL -W ISE N EURAL N ETWORK Q UANTIZATION ∗

This paper proposes a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another Qbn for each activation layer, to reduce the inference latency and decrease the inference energy consumption while achieving the same inference accuracy.

References

SHOWING 1-10 OF 25 REFERENCES

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

BISMO is presented, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing that utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism.

Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs

  • Thomas B. Preußer
  • Computer Science
    2017 27th International Conference on Field Programmable Logic and Applications (FPL)
  • 2017
A generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool and is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations.

Energy-efficient large-scale matrix multiplication on FPGAs

This work uses a highly optimized on-chip matrix multiplication architecture extended to support large matrices using external memory to present an efficient data layout for storing the input matrices and proposes a memory activation schedule based on a realistic DRAM model.

Pipelined compressor tree optimization using integer linear programming

  • M. KummP. Zipf
  • Computer Science
    2014 24th International Conference on Field Programmable Logic and Applications (FPL)
  • 2014
This work defines the pipelined compressor tree synthesis as an optimization problem and proposes a (resource) optimal method using integer linear programming (ILP), and two new GPC mappings with high efficiency are proposed for Xilinx FPGAs.

Compressor tree synthesis on commercial high-performance FPGAs

The experimental results show that the use of compressor trees can reduce critical path delay by 33% and 45% respectively, compared to adder trees synthesized on the Xilinx Virtex-5 and Altera Stratix III FPGAs.

Advanced Compressor Tree Synthesis for FPGAs

Novel methods for the optimization of compressor trees for FPGAs as required in many arithmetic computations are presented and it is shown that these methods provide pipelined compressor trees with about 40 percent less LUTs compared to trees of 2-input adders at the cost of being about 12 ...20 percent slower.

Why systolic architectures?

The basic principle of systolic architectures is reviewed and it is explained why they should result in cost-effective, highperformance special-purpose systems for a wide range of problems.

The Landscape of Parallel Computing Research: A View from Berkeley

The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

A Survey of Techniques for Approximate Computing

A survey of techniques for approximate computing (AC), which discusses strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units, processor components, memory technologies, and so forth, as well as programming frameworks for AC.

Stripes: Bit-Serial Deep Neural Network Computing

This work presents STR, a hardware accelerator that uses bit-serial computations to improve energy efficiency and performance and its area and power overhead are estimated at 5 percent and 12 percent respectively.