# Strassen's Algorithm Reloaded

@article{Huang2016StrassensAR, title={Strassen's Algorithm Reloaded}, author={Jianyu Huang and Tyler Michael Smith and Greg M. Henry and Robert A. Geijn}, journal={SC16: International Conference for High Performance Computing, Networking, Storage and Analysis}, year={2016}, pages={690-701} }

We dispel with “street wisdom” regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it…

## 37 Citations

Strassen’s Algorithm Reloaded on GPUs

- Computer ScienceACM Trans. Math. Softw.
- 2020

A performance model for NVIDIA Volta GPUs is developed to select the appropriate blocking parameters and predict the performance for gemm and Strassen, and it is developed that can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU.

Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs

- Computer ScienceArXiv
- 2018

These algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace to exploit intra- and inter-kernel parallelism.

Making Strassen Matrix Multiplication Safe

- Computer Science2018 IEEE 25th International Conference on High Performance Computing (HiPC)
- 2018

This paper presents an efficient technique to obtain rigorous error bounds for floating point computations based on an implementation of unum arithmetic and proposes a novel error-based heuristic rotation scheme for matrix quadrant rotation.

Generating Families of Practical Fast Matrix Multiplication Algorithms

- Computer Science2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2017

This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use and demonstrates a performance benefit over conventional GEMM on single core and multi-core systems.

Improved algorithms for Boolean matrix multiplication via opportunistic matrix multiplication

- Computer ScienceArXiv
- 2021

A more efficient way to use the broken matrix multiplication algorithm to solve Boolean matrix multiplication, by forming a new larger matrix by sampling and run a single iteration of the broken algorithm on it.

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

- Computer ScienceInt. J. High Perform. Comput. Appl.
- 2021

For a large range of matrix sizes in the domain of interest, this work achieves at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

- Computer ScienceACM Trans. Math. Softw.
- 2021

The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Strassen's Algorithm for Tensor Contraction

- Computer ScienceSIAM J. Sci. Comput.
- 2018

This paper is believed to be the first paper to demonstrate how one can in practice speed up TC with Strassen's algorithm by adopting a block-scatter-matrix format, a novel matrix-centric tensor layout, and reducing the overhead of memory movement that is incurred.

Supporting mixed-datatype matrix multiplication within the BLIS framework

- Computer ScienceArXiv
- 2019

The mixed-datatype implementation confirms that combinatoric intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

IMPLEMENTING HIGH-PERFORMANCE COMPLEX MATRIX 1 MULTIPLICATION VIA THE 1 M METHOD 2

- Computer Science
- 2020

A superior 1m method for expressing complex matrix mul12 tiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m.

## References

SHOWING 1-10 OF 36 REFERENCES

Implementation of Strassen's Algorithm for Matrix Multiplication

- Computer ScienceProceedings of the 1996 ACM/IEEE Conference on Supercomputing
- 1996

The implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix mulitplication routine, and reconfirms that Strassen's algorithm is practical for realistic size matrices.

Communication-Avoiding Parallel Strassen: Implementation and performance

- Computer Science2012 International Conference for High Performance Computing, Networking, Storage and Analysis
- 2012

This paper model and analyze the performance of CAPS, a new Communication-Avoiding Parallel Strassen algorithm that minimizes communication, and demonstrates significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors.

Anatomy of High-Performance Many-Threaded Matrix Multiplication

- Computer Science2014 IEEE 28th International Parallel and Distributed Processing Symposium
- 2014

This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.

Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm

- Computer ScienceISSAC '09
- 2009

We propose several new schedules for Strassen-Winograd's matrix multiplication algorithm, they reduce the extra memory allocation requirements by three different means: by introducing a few…

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

- Computer ScienceTOMS
- 2011

It is investigated how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.

Improving the Numerical Stability of Fast Matrix Multiplication

- Computer ScienceSIAM J. Matrix Anal. Appl.
- 2016

It is argued in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and ways to improve the accuracy both theoretically and empirically are explored.

A High Performance Parallel Strassen Implementation

- Computer ScienceParallel Process. Lett.
- 1996

In this paper, we give what we believe to be the first high performance parallel implementation of Strassen''s algorithm for matrix multiplication. We show how under restricted conditions, this…

GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm

- Computer ScienceJournal of Computational Physics
- 1994

This work reconsiders Winograd's variant of Strassen's algorithm and offers a highly portable solution based on the Level 3 BLAS interface that offers some relief when huge, well-conditioned matrices are multiplied together.

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

- Computer ScienceACM Trans. Math. Softw.
- 2015

Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

High-Performance Tensor Contraction without BLAS

- Computer ScienceArXiv
- 2016

This work implements tensor contraction using the much more flexible BLIS framework, which allows for reshaping of the tensor to be fused with internal partitioning and packing operations, requiring no explicit reshaping operations or additional workspace.