Strassen's Algorithm Reloaded

@article{Huang2016StrassensAR,
  title={Strassen's Algorithm Reloaded},
  author={Jianyu Huang and Tyler Michael Smith and Greg M. Henry and Robert A. Geijn},
  journal={SC16: International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2016},
  pages={690-701}
}
  • Jianyu Huang, T. Smith, R. Geijn
  • Published 13 November 2016
  • Computer Science
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
We dispel with “street wisdom” regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it… 

Figures from this paper

Strassen’s Algorithm Reloaded on GPUs
TLDR
A performance model for NVIDIA Volta GPUs is developed to select the appropriate blocking parameters and predict the performance for gemm and Strassen, and it is developed that can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU.
Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs
TLDR
These algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace to exploit intra- and inter-kernel parallelism.
Making Strassen Matrix Multiplication Safe
TLDR
This paper presents an efficient technique to obtain rigorous error bounds for floating point computations based on an implementation of unum arithmetic and proposes a novel error-based heuristic rotation scheme for matrix quadrant rotation.
Generating Families of Practical Fast Matrix Multiplication Algorithms
TLDR
This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use and demonstrates a performance benefit over conventional GEMM on single core and multi-core systems.
Improved algorithms for Boolean matrix multiplication via opportunistic matrix multiplication
TLDR
A more efficient way to use the broken matrix multiplication algorithm to solve Boolean matrix multiplication, by forming a new larger matrix by sampling and run a single iteration of the broken algorithm on it.
Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs
TLDR
For a large range of matrix sizes in the domain of interest, this work achieves at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.
Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework
TLDR
The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.
Strassen's Algorithm for Tensor Contraction
TLDR
This paper is believed to be the first paper to demonstrate how one can in practice speed up TC with Strassen's algorithm by adopting a block-scatter-matrix format, a novel matrix-centric tensor layout, and reducing the overhead of memory movement that is incurred.
Supporting mixed-datatype matrix multiplication within the BLIS framework
TLDR
The mixed-datatype implementation confirms that combinatoric intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.
IMPLEMENTING HIGH-PERFORMANCE COMPLEX MATRIX 1 MULTIPLICATION VIA THE 1 M METHOD 2
TLDR
A superior 1m method for expressing complex matrix mul12 tiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Implementation of Strassen's Algorithm for Matrix Multiplication
TLDR
The implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix mulitplication routine, and reconfirms that Strassen's algorithm is practical for realistic size matrices.
Communication-Avoiding Parallel Strassen: Implementation and performance
TLDR
This paper model and analyze the performance of CAPS, a new Communication-Avoiding Parallel Strassen algorithm that minimizes communication, and demonstrates significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors.
Anatomy of High-Performance Many-Threaded Matrix Multiplication
TLDR
This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.
Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm
We propose several new schedules for Strassen-Winograd's matrix multiplication algorithm, they reduce the extra memory allocation requirements by three different means: by introducing a few
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation
TLDR
It is investigated how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.
Improving the Numerical Stability of Fast Matrix Multiplication
TLDR
It is argued in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and ways to improve the accuracy both theoretically and empirically are explored.
A High Performance Parallel Strassen Implementation
In this paper, we give what we believe to be the first high performance parallel implementation of Strassen''s algorithm for matrix multiplication. We show how under restricted conditions, this
GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm
TLDR
This work reconsiders Winograd's variant of Strassen's algorithm and offers a highly portable solution based on the Level 3 BLAS interface that offers some relief when huge, well-conditioned matrices are multiplied together.
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
TLDR
Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
High-Performance Tensor Contraction without BLAS
TLDR
This work implements tensor contraction using the much more flexible BLIS framework, which allows for reshaping of the tensor to be fused with internal partitioning and packing operations, requiring no explicit reshaping operations or additional workspace.
...
...