Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

@article{Zee2017ImplementingHC,
  title={Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods},
  author={Field G. Van Zee and Tyler Michael Smith},
  journal={ACM Transactions on Mathematical Software (TOMS)},
  year={2017},
  volume={44},
  pages={1 - 36}
}
  • F. V. Zee, T. Smith
  • Published 24 July 2017
  • Computer Science
  • ACM Transactions on Mathematical Software (TOMS)
In this article, we explore the implementation of complex matrix multiplication. We begin by briefly identifying various challenges associated with the conventional approach, which calls for a carefully written kernel that implements complex arithmetic at the lowest possible level (i.e., assembly language). We then set out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether. This constraint promotes code reuse and portability within libraries… 
IMPLEMENTING HIGH-PERFORMANCE COMPLEX MATRIX 1 MULTIPLICATION VIA THE 1 M METHOD 2
TLDR
A superior 1m method for expressing complex matrix mul12 tiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m.
0 Implementing high-performance complex matrix multiplication via the 1 m method
TLDR
A superior 1m method for expressing complex matrix multiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m.
Inducing complex matrix multiplication via the 1 m method FLAME Working Note # 85 Field
TLDR
A superior 1m method for expressing complex matrix multiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m and is actually a special case of a larger family of algorithms based on a 2m method, which is generally well-suited for storage formats that separate real and imaginary parts into separate matrices.
Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework
TLDR
The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.
Mixed data layout kernels for vectorized complex arithmetic
TLDR
This work demonstrates that performance improvements of up to 2× can be attained with mixed format within the computational routines and described how existing algorithms can be easily modified to implement the mixed format complex layout.
Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra
TLDR
A new approach using a microkernel framework provided by ATLAS to improve the performance of a few linear algebra routines for all problem sizes to overcome the shortcomings of conventional approaches.
Supporting mixed-datatype matrix multiplication within the BLIS framework
TLDR
The mixed-datatype implementation confirms that combinatoric intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.
Implementing Strassen ’ s Algorithm with BLIS FLAME Working Note # 79
TLDR
The practical implementation of Strassen’s algorithm for matrix-matrix multiplication (DGEMM) requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations and can be plug-compatible with the standard DG EMM interface.
Learning from Optimizing Matrix-Matrix Multiplication
TLDR
A carefully designed and scaffolded set of exercises leads the learner from a naive implementation towards one that extracts parallelism at multiple levels, ranging from instruction level parallelism to multithreaded parallelism via OpenMP to distributed memory parallelism using MPI.
On the Efficacy and High-Performance Implementation of Quaternion Matrix Multiplication
TLDR
An optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix operation.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Anatomy of High-Performance Many-Threaded Matrix Multiplication
TLDR
This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
TLDR
Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
Exploiting fast matrix multiplication within the level 3 BLAS
TLDR
Algorithm for the BLAS3 operations that are asymptotically faster than the conventional ones are described, based on Strassen's method for fast matrix multiplication, which is now recognized to be a practically useful technique once matrix dimensions exceed about 100.
The BLIS Framework
TLDR
It is shown, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS, and commercial vendor implementations such as AMD's ACML, IBM's ESSL, and Intel’s MKL libraries.
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
TLDR
This extended abstract revisits the computation of the inverse of a symmetric positive definite matrix and demonstrates that, for some variants, non trivial compiler techniques need then to be applied to further increase the parallelism of the application.
Automatically Tuned Linear Algebra Software
TLDR
An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
TLDR
A variety of methods were employed to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions.
Stability of block algorithms with fast level-3 BLAS
TLDR
The numerical stability of the block algorithms in the new linear algebra program library LAPACK is investigated and it is shown that these algorithms have backward error analyses in which the backward error bounds are commensurate with the error bounds for the underlying level-3 BLAS (BLAS3).
Gaussian elimination is not optimal
t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4 . 7 n l°g7 arithmetical
Anatomy of high-performance matrix multiplication
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified
...
...