# Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

@article{Zee2017ImplementingHC, title={Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods}, author={Field G. Van Zee and Tyler Michael Smith}, journal={ACM Transactions on Mathematical Software (TOMS)}, year={2017}, volume={44}, pages={1 - 36} }

In this article, we explore the implementation of complex matrix multiplication. We begin by briefly identifying various challenges associated with the conventional approach, which calls for a carefully written kernel that implements complex arithmetic at the lowest possible level (i.e., assembly language). We then set out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether. This constraint promotes code reuse and portability within libraries…

## Figures and Tables from this paper

## 17 Citations

IMPLEMENTING HIGH-PERFORMANCE COMPLEX MATRIX 1 MULTIPLICATION VIA THE 1 M METHOD 2

- Computer Science
- 2020

A superior 1m method for expressing complex matrix mul12 tiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m.

0 Implementing high-performance complex matrix multiplication via the 1 m method

- Computer Science
- 2017

A superior 1m method for expressing complex matrix multiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m.

Inducing complex matrix multiplication via the 1 m method FLAME Working Note # 85 Field

- Computer Science
- 2017

A superior 1m method for expressing complex matrix multiplication is derived, one which addresses virtually all of the shortcomings inherent in 4m and is actually a special case of a larger family of algorithms based on a 2m method, which is generally well-suited for storage formats that separate real and imaginary parts into separate matrices.

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

- Computer ScienceACM Trans. Math. Softw.
- 2021

The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Mixed data layout kernels for vectorized complex arithmetic

- Computer Science2017 IEEE High Performance Extreme Computing Conference (HPEC)
- 2017

This work demonstrates that performance improvements of up to 2× can be attained with mixed format within the computational routines and described how existing algorithms can be easily modified to implement the mixed format complex layout.

Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra

- Computer Science
- 2017

A new approach using a microkernel framework provided by ATLAS to improve the performance of a few linear algebra routines for all problem sizes to overcome the shortcomings of conventional approaches.

Supporting mixed-datatype matrix multiplication within the BLIS framework

- Computer ScienceArXiv
- 2019

The mixed-datatype implementation confirms that combinatoric intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Implementing Strassen ’ s Algorithm with BLIS FLAME Working Note # 79

- Computer Science
- 2016

The practical implementation of Strassen’s algorithm for matrix-matrix multiplication (DGEMM) requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations and can be plug-compatible with the standard DG EMM interface.

Learning from Optimizing Matrix-Matrix Multiplication

- Computer Science2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- 2018

A carefully designed and scaffolded set of exercises leads the learner from a naive implementation towards one that extracts parallelism at multiple levels, ranging from instruction level parallelism to multithreaded parallelism via OpenMP to distributed memory parallelism using MPI.

On the Efficacy and High-Performance Implementation of Quaternion Matrix Multiplication

- Computer ScienceArXiv
- 2019

An optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix operation.

## References

SHOWING 1-10 OF 27 REFERENCES

Anatomy of High-Performance Many-Threaded Matrix Multiplication

- Computer Science2014 IEEE 28th International Parallel and Distributed Processing Symposium
- 2014

This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

- Computer ScienceACM Trans. Math. Softw.
- 2015

Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

Exploiting fast matrix multiplication within the level 3 BLAS

- Computer ScienceTOMS
- 1990

Algorithm for the BLAS3 operations that are asymptotically faster than the conventional ones are described, based on Strassen's method for fast matrix multiplication, which is now recognized to be a practically useful technique once matrix dimensions exceed about 100.

The BLIS Framework

- Computer ScienceACM Trans. Math. Softw.
- 2016

It is shown, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS, and commercial vendor implementations such as AMD's ACML, IBM's ESSL, and Intel’s MKL libraries.

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

- Computer ScienceVECPAR
- 2010

This extended abstract revisits the computation of the inverse of a symmetric positive definite matrix and demonstrates that, for some variants, non trivial compiler techniques need then to be applied to further increase the parallelism of the application.

Automatically Tuned Linear Algebra Software

- Computer ScienceProceedings of the IEEE/ACM SC98 Conference
- 1998

An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).

Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor

- Computer ScienceICPADS
- 2012

A variety of methods were employed to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions.

Stability of block algorithms with fast level-3 BLAS

- Computer ScienceTOMS
- 1992

The numerical stability of the block algorithms in the new linear algebra program library LAPACK is investigated and it is shown that these algorithms have backward error analyses in which the backward error bounds are commensurate with the error bounds for the underlying level-3 BLAS (BLAS3).

Gaussian elimination is not optimal

- Mathematics
- 1969

t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4 . 7 n l°g7 arithmetical…

Anatomy of high-performance matrix multiplication

- Computer ScienceTOMS
- 2008

We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified…