Anatomy of high-performance matrix multiplication

@article{Goto2008AnatomyOH,
  title={Anatomy of high-performance matrix multiplication},
  author={Kazushige Goto and Robert A. van de Geijn},
  journal={ACM Trans. Math. Softw.},
  year={2008},
  volume={34},
  pages={12:1-12:25}
}
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance. 
High-performance implementation of the level-3 BLAS
A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly usedExpand
Cache Friendly Strategies to Optimize Matrix Multiplication
TLDR
This work aims to showcase the effect of developing matrix multiplication strategies that are less time and processor intensive by effectively handling memory accesses by using OpenMP, a multiprocessing toolkit to show theeffect of parallelizing matrix multiplication. Expand
GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science
TLDR
This paper is an attempt to summarize theoretical and practical approaches which were used to develop high performance BLAS code and shows ideas as implementation analysis and efficient memory usage are useful for many real world problems. Expand
Implementing High-Performance Complex Matrix Multiplication via the 1M Method
  • F. V. Zee
  • Computer Science, Mathematics
  • SIAM J. Sci. Comput.
  • 2020
Almost all efforts to optimize high-performance matrix-matrix multiplication have been focused on the case where matrices contain real elements. The community's collective assumption appears to hav...
0 Implementing high-performance complex matrix multiplication via the 3 m and 4 m methods
of these so-called “induced” methods, and observe that the assembly-level method actually resides along the 4M spectrum of algorithmic variants. Implementations are developed within the BLISExpand
On Composing Matrix Multiplication from Kernels
Matrix multiplication is often treated as a basic unit of computation in terms of which other operations are implemented, yielding high performance. In this paper initial evidence is provided thatExpand
High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning
TLDR
Following standard practice for inference with convolutional neural networks, the GEMM kernel operates with 16-bit integer arithmetic, yielding significant performance acceleration and cutting the memory requirements with respect to IEEE single precision by half, allowing the deployment of larger neural network models on low power devices with limited storage capacity. Expand
Towards a High-Performance , Low-Power Linear Algebra Processor
Achieving high-performance while reducing power consumption is the key question as technology scaling is reaching its limits. It is well-accepted that application-specific custom hardware can achieveExpand
High-Performance Matrix Multiply on a Massively Multithreaded Fiteng1000 Processor
TLDR
This paper presents parallel algorithms with shared A or B matrix in the memory for the special massively multithreaded Fiteng1000 processor and shows that the algorithms have well good parallel performance and achieve near-peak performance. Expand
Design of a massively parallel computing architecture for dense matrix multiplication
TLDR
This paper starts with a formal analysis of the algorithms considering architectural aspects, and then decides the structure of the architecture, which is near two orders of magnitude more performance/area efficient than a cutting-edge general-purpose processor achieving near 1 TFLOP in a 100 mm2 chip with 65 nm technology. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
High-performance implementation of the level-3 BLAS
A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly usedExpand
A Family of High-Performance Matrix Multiplication Algorithms
TLDR
Using a simple model of hierarchical memories, mathematics is employed to determine a locally-optimal strategy for blocking matrices and the resulting family of algorithms yields performance that is superior to that of methods that automatically tune such kernels. Expand
A Family of High-Performance Matrix Multiplication Algorithms
TLDR
Using a simple model of hierarchical memories, mathematics is employed to determine a locally-optimal strategy for blocking matrices and the resulting family of algorithms yields performance that is superior to that of methods that automatically tune such kernels. Expand
A Note On Parallel Matrix Inversion
We present one-sweep parallel algorithms for the inversion of general and symmetric positive definite matrices. The algorithms feature simple programming and performance optimization whileExpand
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
TLDR
The paper gives two examples that illustrate how the algorithms and architectural features interplay to produce high-performance codes and included in ESSL (Engineering and Scientific Subroutine Library); an overview of ESSL is also given. Expand
A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design
TLDR
Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivery of peak memory bandwidth for memory-bound kernel such as daxpy, while being largely insensitive to data alignment. Expand
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
TLDR
This work states different algorithms for each of these sweeps of the inversion of a Symmetric Positive Definite matrix as well as algorithms that compute the result in a single sweep and outperforms the current ScaLAPACK implementation by 20-30 percent due to improved load-balance on a distributed memory architecture. Expand
New trends in high performance computing
TLDR
The automatically tuned linear algebra software (ATLAS) project is described, as well as the fundamental principles that underly it, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library. Expand
FLAME: Formal Linear Algebra Methods Environment
TLDR
This paper illustrates the observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures, and demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world. Expand
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
TLDR
This work states that it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. Expand
...
1
2
3
4
...