# High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

@article{Gope2020HighTM, title={High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands}, author={Dibakar Gope and Jesse G. Beu and Matthew Mattina}, journal={ArXiv}, year={2020}, volume={abs/2008.00638} }

Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of…

## 2 Citations

### Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

- Computer ScienceACM Trans. Archit. Code Optim.
- 2021

A configurable multi-directional systolic array (CMSA) that can increase the PE utilization rate by up to 1.6 times and redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies to speed up the calculation of depthwise convolution.

### Pushing the Envelope of Dynamic Spatial Gating technologies

- Computer ScienceAIChallengeIoT@SenSys
- 2020

This paper focuses on one such technology that targets unimportant features in the spatial domain of OFM, called Precision Gating (PG), and shows that PG leads to loss in accuracy when the authors push the MAC reduction achieved by a PG network.

## References

SHOWING 1-10 OF 22 REFERENCES

### Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1

- Computer ScienceAAAI
- 2019

A novel encoding scheme of using {-1,+1} to decompose quantized neural networks (QNNs) into multi-branch binary networks, which can be efficiently implemented by bitwise operations (xnor and bitcount) to achieve model compression, computational acceleration and resource saving.

### In-datacenter performance analysis of a tensor processing unit

- Computer Science2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
- 2017

This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

### StrassenNets: Deep learning with a multiplication budget

- Computer ScienceICML
- 2018

It is demonstrated that the proposed framework is able to rediscover Strassen's matrix multiplication algorithm, learning to multiply $2 \times 2$ matrices using only 7 multiplications instead of 8, a first-of-a-kind reduction in number of multiplications.

### Ternary MobileNets via Per-Layer Hybrid Filter Banks

- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2020

A novel quantization method is proposed that generates per-layer hybrid filter banks consisting of full-precision and ternary weight filters for MobileNets and demonstrates the generalizability and effectiveness of hybrid filter Banks to other neural network architectures.

### Aggressive Compression of MobileNets Using Hybrid Ternary Layers

- Computer Science
- 2019

Problem to be solved: In a neural network with binary (-1, 1) or ternary (-1, 0, 1) weights, multiplications are replaced by additions. Multipliers consume significantly more area and energy than…

### Run-Time Efficient RNN Compression for Inference on Edge Devices

- Computer Science2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)
- 2019

A new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) is explored that results in faster inference runtime than pruning and better accuracy than matrix factorization for compression factors of 2-4x.

### Quantization Networks

- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019

This paper provides a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function that will shed new lights on the interpretation of neural network quantization.

### Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

- Computer ScienceICLR
- 2016

This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

### Speeding up Convolutional Neural Networks with Low Rank Expansions

- Computer ScienceBMVC
- 2014

Two simple schemes for drastically speeding up convolutional neural networks are presented, achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain.

### Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit?

- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019

The Binary Ensemble Neural Network (BENN) is proposed, which leverages ensemble methods to improve the performance of BNNs with limited efficiency cost and can even surpass the accuracy of the full-precision floating number network with the same architecture.