# A framework for practical parallel fast matrix multiplication

@article{Benson2015AFF, title={A framework for practical parallel fast matrix multiplication}, author={Austin R. Benson and Grey Ballard}, journal={Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, year={2015} }

Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple…

## Figures and Tables from this paper

## 66 Citations

Generating Families of Practical Fast Matrix Multiplication Algorithms

- Computer Science2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2017

This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use and demonstrates a performance benefit over conventional GEMM on single core and multi-core systems.

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

- Computer ScienceICPP
- 2021

This paper proposes a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine, which decreases the computational cost to the time required byStrassen’s algorithm, amounting to floating point operations.

Improving the Space-Time Efficiency of Matrix Multiplication Algorithms

- Computer ScienceICPP Workshops
- 2020

This study gives out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring for cache-oblivious parallel algorithms.

Improving the Numerical Stability of Fast Matrix Multiplication

- Computer ScienceSIAM J. Matrix Anal. Appl.
- 2016

It is argued in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and ways to improve the accuracy both theoretically and empirically are explored.

Accelerating Matrix Processing with GPUs

- Computer Science2017 IEEE 24th Symposium on Computer Arithmetic (ARITH)
- 2017

This paper discusses how to map a variety of important matrix computations, including sparse matrix-vector multiplication (SpMV), sparse triangle solve, graph processing, and dense matrix-matrix multiplication, to GPUs.

Sparsifying the Operators of Fast Matrix Multiplication Algorithms

- Computer ScienceArXiv
- 2020

Three new methods to reduce leading coefficients of sub-cubic matrix multiplication algorithms by sparsifying an algorithm's bilinear operator are obtained, and two are guaranteed to produce leading coefficients that are optimal.

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

- Computer ScienceParallel Comput.
- 2016

Faster Matrix Multiplication via Sparse Decomposition

- Computer ScienceSPAA
- 2019

Lower bounds matching the coefficient of several of the algorithms are obtained, proving them to be optimal, and a few new sub-cubic algorithms with leading coefficient 2, matching that of classical matrix multiplication are obtained.

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

- Computer ScienceInt. J. High Perform. Comput. Appl.
- 2021

For a large range of matrix sizes in the domain of interest, this work achieves at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Cloud Matrix Machine for Julia and Implicit Parallelization for Matrix Languages

- Computer Science
- 2022

A new framework called cloud matrix machine is presented, which extends the Julia high-performance compute language to automatically parallelize matrix computations for the cloud, which achieved speedups of up to a factor of 3 .

## References

SHOWING 1-10 OF 70 REFERENCES

Communication-optimal parallel algorithm for strassen's matrix multiplication

- Computer ScienceSPAA '12
- 2012

A new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication is obtained, and it exhibits perfect strong scaling within the maximum possible range.

Communication-Avoiding Parallel Strassen: Implementation and performance

- Computer Science2012 International Conference for High Performance Computing, Networking, Storage and Analysis
- 2012

This paper model and analyze the performance of CAPS, a new Communication-Avoiding Parallel Strassen algorithm that minimizes communication, and demonstrates significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors.

A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers

- Computer ScienceSAC '95
- 1995

Here a scalable parallel Strassen`s matrix multiply algorithm for distributed memory, message passing computers is presented and compared with several other parallel algorithms.

A High Performance Parallel Strassen Implementation

- Computer ScienceParallel Process. Lett.
- 1996

In this paper, we give what we believe to be the first high performance parallel implementation of Strassen''s algorithm for matrix multiplication. We show how under restricted conditions, this…

The aggregation and cancellation techniques as a practical tool for faster matrix multiplication

- Computer ScienceTheor. Comput. Sci.
- 2004

Strassen ' s Algorithm for Matrix Multiplication : Modeling , Analysis , and ImplementationSteven

- Computer Science
- 1996

This paper reports on the development of an e cient and portable implementation of Strassen's matrix multiplication algorithm for matrices of arbitrary size designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine.

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

- Computer ScienceTOMS
- 2011

It is investigated how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.

Implementation of Strassen's Algorithm for Matrix Multiplication

- Computer ScienceProceedings of the 1996 ACM/IEEE Conference on Supercomputing
- 1996

The implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix mulitplication routine, and reconfirms that Strassen's algorithm is practical for realistic size matrices.

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

- Computer ScienceEuro-Par
- 2011

A novel lower bound on the latency cost of 2.5D and 3D LU factorization is proved, showing that while c copies of the data can be reduced, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency.

Benchmarking GPUs to tune dense linear algebra

- Computer Science2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
- 2008

It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.