A framework for practical parallel fast matrix multiplication

  title={A framework for practical parallel fast matrix multiplication},
  author={Austin R. Benson and Grey Ballard},
  journal={Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  • Austin R. Benson, Grey Ballard
  • Published 9 September 2014
  • Computer Science
  • Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple… 

Figures and Tables from this paper

Generating Families of Practical Fast Matrix Multiplication Algorithms
This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use and demonstrates a performance benefit over conventional GEMM on single core and multi-core systems.
Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose
This paper proposes a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine, which decreases the computational cost to the time required byStrassen’s algorithm, amounting to floating point operations.
Improving the Space-Time Efficiency of Matrix Multiplication Algorithms
This study gives out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring for cache-oblivious parallel algorithms.
Improving the Numerical Stability of Fast Matrix Multiplication
It is argued in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and ways to improve the accuracy both theoretically and empirically are explored.
Accelerating Matrix Processing with GPUs
This paper discusses how to map a variety of important matrix computations, including sparse matrix-vector multiplication (SpMV), sparse triangle solve, graph processing, and dense matrix-matrix multiplication, to GPUs.
Sparsifying the Operators of Fast Matrix Multiplication Algorithms
Three new methods to reduce leading coefficients of sub-cubic matrix multiplication algorithms by sparsifying an algorithm's bilinear operator are obtained, and two are guaranteed to produce leading coefficients that are optimal.
Faster Matrix Multiplication via Sparse Decomposition
Lower bounds matching the coefficient of several of the algorithms are obtained, proving them to be optimal, and a few new sub-cubic algorithms with leading coefficient 2, matching that of classical matrix multiplication are obtained.
Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs
For a large range of matrix sizes in the domain of interest, this work achieves at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.
Cloud Matrix Machine for Julia and Implicit Parallelization for Matrix Languages
A new framework called cloud matrix machine is presented, which extends the Julia high-performance compute language to automatically parallelize matrix computations for the cloud, which achieved speedups of up to a factor of 3 .


Communication-optimal parallel algorithm for strassen's matrix multiplication
A new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication is obtained, and it exhibits perfect strong scaling within the maximum possible range.
Communication-Avoiding Parallel Strassen: Implementation and performance
This paper model and analyze the performance of CAPS, a new Communication-Avoiding Parallel Strassen algorithm that minimizes communication, and demonstrates significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors.
A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers
Here a scalable parallel Strassen`s matrix multiply algorithm for distributed memory, message passing computers is presented and compared with several other parallel algorithms.
A High Performance Parallel Strassen Implementation
In this paper, we give what we believe to be the first high performance parallel implementation of Strassen''s algorithm for matrix multiplication. We show how under restricted conditions, this
The aggregation and cancellation techniques as a practical tool for faster matrix multiplication
Strassen ' s Algorithm for Matrix Multiplication : Modeling , Analysis , and ImplementationSteven
This paper reports on the development of an e cient and portable implementation of Strassen's matrix multiplication algorithm for matrices of arbitrary size designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine.
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation
It is investigated how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.
Implementation of Strassen's Algorithm for Matrix Multiplication
The implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix mulitplication routine, and reconfirms that Strassen's algorithm is practical for realistic size matrices.
Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms
A novel lower bound on the latency cost of 2.5D and 3D LU factorization is proved, showing that while c copies of the data can be reduced, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency.
Benchmarking GPUs to tune dense linear algebra
  • V. VolkovJ. Demmel
  • Computer Science
    2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2008
It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.