High-performance implementation of the level-3 BLAS

  title={High-performance implementation of the level-3 BLAS},
  author={Kazushige Goto and Robert A. van de Geijn},
  journal={ACM Trans. Math. Softw.},
A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures. 
Anatomy of high-performance matrix multiplication
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justifiedExpand
0 Implementing high-performance complex matrix multiplication via the 3 m and 4 m methods
of these so-called “induced” methods, and observe that the assembly-level method actually resides along the 4M spectrum of algorithmic variants. Implementations are developed within the BLISExpand
GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science
This paper is an attempt to summarize theoretical and practical approaches which were used to develop high performance BLAS code and shows ideas as implementation analysis and efficient memory usage are useful for many real world problems. Expand
Attaining High Performance in General-Purpose Computations on Current Graphics Processors
This paper evaluates the performance of linear algebra and image processing routines, both on classical and unified GPU architectures and traditional processors (CPUs). Expand
BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing
Existing multi-GPU level 3 BLAS issues are investigated and issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. Expand
High-Performance Matrix Multiply on a Massively Multithreaded Fiteng1000 Processor
This paper presents parallel algorithms with shared A or B matrix in the memory for the special massively multithreaded Fiteng1000 processor and shows that the algorithms have well good parallel performance and achieve near-peak performance. Expand
Implementing High-Performance Complex Matrix Multiplication via the 1M Method
  • F. V. Zee
  • Computer Science, Mathematics
  • SIAM J. Sci. Comput.
  • 2020
Almost all efforts to optimize high-performance matrix-matrix multiplication have been focused on the case where matrices contain real elements. The community's collective assumption appears to hav...
Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures
This paper presents a parallel implementation framework for dense matrix multiplication on multi-socket multi-core architectures that combines the Winograd algorithm and the classical algorithm to achieve dynamic load balancing and enforce data locality. Expand
GEMM Optimization for a Decoupled Access/Execute Architecture Processor
The GEMM kernel based on the DAE processor was divided into 4 levels, and several levels of the new algorithm are capable to self-adjust, and the performance of the algorithm was effectively improved. Expand
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
A variety of methods were employed to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Expand


Anatomy of high-performance matrix multiplication
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justifiedExpand
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
This work states that it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. Expand
Toward Scalable Matrix Multiply on Multithreaded Architectures
We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memoryExpand
A set of level 3 basic linear algebra subprograms
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portableExpand
FLAME: Formal Linear Algebra Methods Environment
This paper illustrates the observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures, and demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world. Expand
Automatically Tuned Linear Algebra Software
An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). Expand
Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software
Some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today's memory-tiered computer systems are reviewed and details. Expand
LAPACK Users Guide
The third edition of LAPACK provided a guide to troubleshooting and installation of Routines, as well as providing examples of how to convert from LINPACK or EISPACK to BLAS. Expand
ACM Transactions on Mathematical Software
  • ACM Transactions on Mathematical Software
  • 2008
Received May
  • Received May
  • 2006