• Corpus ID: 220936526

High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

@article{Gope2020HighTM,
  title={High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands},
  author={Dibakar Gope and Jesse G. Beu and Matthew Mattina},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.00638}
}
Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of… 

Figures and Tables from this paper

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

A configurable multi-directional systolic array (CMSA) that can increase the PE utilization rate by up to 1.6 times and redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies to speed up the calculation of depthwise convolution.

Pushing the Envelope of Dynamic Spatial Gating technologies

This paper focuses on one such technology that targets unimportant features in the spatial domain of OFM, called Precision Gating (PG), and shows that PG leads to loss in accuracy when the authors push the MAC reduction achieved by a PG network.

References

SHOWING 1-10 OF 22 REFERENCES

Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1

A novel encoding scheme of using {-1,+1} to decompose quantized neural networks (QNNs) into multi-branch binary networks, which can be efficiently implemented by bitwise operations (xnor and bitcount) to achieve model compression, computational acceleration and resource saving.

In-datacenter performance analysis of a tensor processing unit

  • N. JouppiC. Young D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

StrassenNets: Deep learning with a multiplication budget

It is demonstrated that the proposed framework is able to rediscover Strassen's matrix multiplication algorithm, learning to multiply $2 \times 2$ matrices using only 7 multiplications instead of 8, a first-of-a-kind reduction in number of multiplications.

Ternary MobileNets via Per-Layer Hybrid Filter Banks

A novel quantization method is proposed that generates per-layer hybrid filter banks consisting of full-precision and ternary weight filters for MobileNets and demonstrates the generalizability and effectiveness of hybrid filter Banks to other neural network architectures.

Aggressive Compression of MobileNets Using Hybrid Ternary Layers

Problem to be solved: In a neural network with binary (-1, 1) or ternary (-1, 0, 1) weights, multiplications are replaced by additions. Multipliers consume significantly more area and energy than

Run-Time Efficient RNN Compression for Inference on Edge Devices

A new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) is explored that results in faster inference runtime than pruning and better accuracy than matrix factorization for compression factors of 2-4x.

Quantization Networks

This paper provides a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function that will shed new lights on the interpretation of neural network quantization.

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

Speeding up Convolutional Neural Networks with Low Rank Expansions

Two simple schemes for drastically speeding up convolutional neural networks are presented, achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain.

Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit?

  • Shilin ZhuXin DongHao Su
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
The Binary Ensemble Neural Network (BENN) is proposed, which leverages ensemble methods to improve the performance of BNNs with limited efficiency cost and can even surpass the accuracy of the full-precision floating number network with the same architecture.