# BLIS: A Framework for Rapidly Instantiating BLAS Functionality

@article{Zee2015BLISAF, title={BLIS: A Framework for Rapidly Instantiating BLAS Functionality}, author={Field G. Van Zee and Robert A. van de Geijn}, journal={ACM Trans. Math. Softw.}, year={2015}, volume={41}, pages={14:1-14:33} }

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. [... ] Key Method Users of BLAS-dependent applications are given a choice of using the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries… Expand

## Figures from this paper

## 219 Citations

The BLIS Framework

- Computer ScienceACM Trans. Math. Softw.
- 2016

It is shown, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS, and commercial vendor implementations such as AMD's ACML, IBM's ESSL, and Intel’s MKL libraries.

The BLAS API of BLASFEO

- Computer ScienceACM Trans. Math. Softw.
- 2020

This article investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes, and investigates the benefits in scientific programming languages such as Octave, SciPy, and Julia.

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

- Computer Science2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
- 2017

The key insight is to leverage a wide range of architecture-specific abstractions already available in LLVM, by first generating a vectorized micro-kernel in the architecture-independent LLVM IR and then improving its performance by applying a series of domain-specific yet architecture- independent optimizations.

Integration and exploitation of intra-routine malleability in BLIS

- Computer ScienceThe Journal of Supercomputing
- 2019

This paper leverages low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and demonstrates the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-ahead.

Automating the Last-Mile for High Performance Dense Linear Algebra

- Computer ScienceArXiv
- 2016

This paper distill the implementation of the Gemm kernel into an even smaller kernel, an outer-product, and analytically determine how available SIMD instructions can be used to compute the outer- product efficiently, to generate kernels with performance that is competitive with kernels implemented manually or using empirical search.

FT-BLAS: a high performance BLAS implementation with online fault tolerance

- Computer ScienceICS
- 2021

FT-BLAS is presented, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake.

Modern Generative Programming for Optimizing Small Matrix-Vector Multiplication

- Computer Science2018 International Conference on High Performance Computing & Simulation (HPCS)
- 2018

This paper proposes to show how a modern C++ approach based on generative programming techniques such as vectorization and loop unrolling in the framework of meta-programming can deliver efficient automatically generated codes for such routines, that are competitive with existing, hand-tuned library kernels with a very low programming effort compared to writing assembly code.

Towards ABFT for BLIS GEMM FLAME Working Note # 76

- Computer Science
- 2015

It is demonstrated that ABFT can be incorporated into the BLAS-like Instantiation Software (BLIS) framework’s implementation of this operation, degrading performance by only 10-15% on current multicore architectures like the Intel Xeon E5-2580 processor with 16 cores and cutting edge many-core architectures likeThe Intel Xeon Phi processor with 60 cores.

Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

- Computer ScienceACM Trans. Math. Softw.
- 2017

This article sets out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether and develops two alternative approaches—one based on the 3m method and one that reflects the classic 4m formulation—each with multiple variants that rely only on real matrix multiplication kernels.

Evaluation of Open-Source Linear Algebra Libraries targeting ARM and RISC-V Architectures

- Computer Science2020 15th Conference on Computer Science and Information Systems (FedCSIS)
- 2020

Results show that especially for matrix operations and larger problem sizes, optimized BLAS implementations allow for significant performance gains when compared to pure C implementations.

## References

SHOWING 1-10 OF 78 REFERENCES

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

- Computer Science
- 2015

Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

Implementing Level-3 BLAS with BLIS : Early Experience FLAME Working Note # 69 Field

- Computer Science
- 2013

This paper demonstrates how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures, and provides compelling results that suggest the framework’s leverage extends to the multithreaded domain.

Anatomy of High-Performance Many-Threaded Matrix Multiplication

- Computer Science2014 IEEE 28th International Parallel and Distributed Processing Symposium
- 2014

This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.

Build to order linear algebra kernels

- Computer Science2008 IEEE International Symposium on Parallel and Distributed Processing
- 2008

Preliminary work is presented on a domain- specific compiler that generates implementations for arbitrary sequences of basic linear algebra operations and tunes them for memory efficiency.

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

- Computer ScienceTOMS
- 1998

This work states that it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation.

Automating the generation of composed linear algebra kernels

- Computer ScienceProceedings of the Conference on High Performance Computing Networking, Storage and Analysis
- 2009

A novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices is described and a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization is presented.

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

- Computer ScienceICS '97
- 1997

A BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/8Oi, SGI Power Challenge RBk, and SGI Octane RlOk, and over 80% ofpeak on the SGI Indigo R4k.

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

- Computer ScienceIEEE Transactions on Computers
- 2012

This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures and argues that these customizations can be generalized to perform other representative linear algebra operations.

FLAME: Formal Linear Algebra Methods Environment

- Computer ScienceTOMS
- 2001

This paper illustrates the observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures, and demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.

Programming matrix algorithms-by-blocks for thread-level parallelism

- Computer ScienceTOMS
- 2009

It is argued that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions, and a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain.