BLIS: A Framework for Rapidly Instantiating BLAS Functionality

@article{Zee2015BLISAF,
  title={BLIS: A Framework for Rapidly Instantiating BLAS Functionality},
  author={Field G. Van Zee and Robert A. van de Geijn},
  journal={ACM Trans. Math. Softw.},
  year={2015},
  volume={41},
  pages={14:1-14:33}
}
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. [] Key Method Users of BLAS-dependent applications are given a choice of using the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries…
The BLIS Framework
TLDR
It is shown, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS, and commercial vendor implementations such as AMD's ACML, IBM's ESSL, and Intel’s MKL libraries.
The BLAS API of BLASFEO
TLDR
This article investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes, and investigates the benefits in scientific programming languages such as Octave, SciPy, and Julia.
Automatic generation of fast BLAS3-GEMM: A portable compiler approach
TLDR
The key insight is to leverage a wide range of architecture-specific abstractions already available in LLVM, by first generating a vectorized micro-kernel in the architecture-independent LLVM IR and then improving its performance by applying a series of domain-specific yet architecture- independent optimizations.
Integration and exploitation of intra-routine malleability in BLIS
TLDR
This paper leverages low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and demonstrates the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-ahead.
Automating the Last-Mile for High Performance Dense Linear Algebra
TLDR
This paper distill the implementation of the Gemm kernel into an even smaller kernel, an outer-product, and analytically determine how available SIMD instructions can be used to compute the outer- product efficiently, to generate kernels with performance that is competitive with kernels implemented manually or using empirical search.
FT-BLAS: a high performance BLAS implementation with online fault tolerance
TLDR
FT-BLAS is presented, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake.
Modern Generative Programming for Optimizing Small Matrix-Vector Multiplication
TLDR
This paper proposes to show how a modern C++ approach based on generative programming techniques such as vectorization and loop unrolling in the framework of meta-programming can deliver efficient automatically generated codes for such routines, that are competitive with existing, hand-tuned library kernels with a very low programming effort compared to writing assembly code.
Towards ABFT for BLIS GEMM FLAME Working Note # 76
TLDR
It is demonstrated that ABFT can be incorporated into the BLAS-like Instantiation Software (BLIS) framework’s implementation of this operation, degrading performance by only 10-15% on current multicore architectures like the Intel Xeon E5-2580 processor with 16 cores and cutting edge many-core architectures likeThe Intel Xeon Phi processor with 60 cores.
Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods
TLDR
This article sets out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether and develops two alternative approaches—one based on the 3m method and one that reflects the classic 4m formulation—each with multiple variants that rely only on real matrix multiplication kernels.
Evaluation of Open-Source Linear Algebra Libraries targeting ARM and RISC-V Architectures
TLDR
Results show that especially for matrix operations and larger problem sizes, optimized BLAS implementations allow for significant performance gains when compared to pure C implementations.
...
...

References

SHOWING 1-10 OF 78 REFERENCES
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
TLDR
Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
Implementing Level-3 BLAS with BLIS : Early Experience FLAME Working Note # 69 Field
TLDR
This paper demonstrates how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures, and provides compelling results that suggest the framework’s leverage extends to the multithreaded domain.
Anatomy of High-Performance Many-Threaded Matrix Multiplication
TLDR
This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.
Build to order linear algebra kernels
TLDR
Preliminary work is presented on a domain- specific compiler that generates implementations for arbitrary sequences of basic linear algebra operations and tunes them for memory efficiency.
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
TLDR
This work states that it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation.
Automating the generation of composed linear algebra kernels
TLDR
A novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices is described and a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization is presented.
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
TLDR
A BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/8Oi, SGI Power Challenge RBk, and SGI Octane RlOk, and over 80% ofpeak on the SGI Indigo R4k.
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures
TLDR
This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures and argues that these customizations can be generalized to perform other representative linear algebra operations.
FLAME: Formal Linear Algebra Methods Environment
TLDR
This paper illustrates the observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures, and demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.
Programming matrix algorithms-by-blocks for thread-level parallelism
TLDR
It is argued that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions, and a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain.
...
...