# MOB forms: a class of multilevel block algorithms for dense linear algebra operations

@inproceedings{Navarro1994MOBFA, title={MOB forms: a class of multilevel block algorithms for dense linear algebra operations}, author={Juan J. Navarro and Toni Juan and Tom{\'a}s Lang}, booktitle={ICS '94}, year={1994} }

Multilevel block algorithms exploit the data locality in linear algebra operations when executed in machines with several levels in the memory hierarchy. It is shown that the family we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design and that using the multilevel approach produces significant performance improvements. The effect of interference in the cache, of the TLB misses, and of page faults are also considered. The multilevel block algorithms are evaluated…

## Figures and Tables from this paper

## 45 Citations

Deliverable HwA 5 b : Multilevel Blocking and Prefetching for LinearAlgebra

- Computer Science
- 2008

This research is aimed to use the multilevel orthogonal blocking approach in conjuntion with other software techniques to further improve the performance of linear algebra computations.

Block Algorithms to speed up the Sparse Matrix by Dense Matrix Multiplication on High Performance Wo

- Computer Science
- 1995

This research is aimed to use the multilevel orthogonal blocking approach in conjuntion with other software techniques to further improve the performance of linear algebra computations.

Block algorithms for sparse matrix computations on high performance workstations

- Computer ScienceICS '96
- 1996

This paper analyzes the use of Blocklng, Data Precopying and Software Pipelining to improve the performance of sparse matrix computations on superscalar workstations and shows that there is a clear difference between the dense case and the sparse case in terms of the compromises to be adopted to optimize the algorithms.

Data Prefetching for Linear Algebra Operations on High Performance Workstations

- Computer Science
- 1995

The performance of the dense matrix by matrix multiplication executed on a super-scalar high performance workstation is improved using binding and nonbinding prefetching to hide the memory latency together with the well known technique of blocking.

Data prefetching and multilevel blocking for linear algebra operations

- Computer ScienceICS '96
- 1996

This paper analyzes the behavior of matrix multiplication algorithms for large matrices on a superscalar and superpipelined processor with a multilevel memory hierarchy when these techniques are applied together, and compares two different approaches to data prefetching, binding versus non-binding, and finds the latter remarkably more effective than the former due mainly to its flexibility.

A framework for efficient execution of matrix computations

- Computer Science
- 2006

This work presents an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time, and shows that techniques used in linear algebra codes can be useful in other fields.

Multilevel Blocking in Complex Iteration Spaces

- Computer Science
- 1996

A technique to perform the loop interchange in non-convex iteration spaces that computes the loop bounds exactly and an order in which to perform index set splitting that guaranties that each loop in the nest will be processed only once and also avoids code explosion are proposed.

Block Algorithms forSparse Matrix by Dense Matrix

- Computer Science
- 1994

The perfomance of forms without blocking is determined and the improvement that can be obtained by using two levels of blocking (at the register and cache levels) is shown.

Exploitation of Multilevel Parallelism on Structured Linear Systems

- Computer Science
- 1996

It is proved that it is necessary for software designers and programmers to have a profound knowledge of the architecture and programming tools of present computers in order to get a good exploitation of their resources.

A framework for high‐performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low‐level kernels

- Computer ScienceConcurr. Comput. Pract. Exp.
- 2002

Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm, and the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high‐performance code is exposed.

## References

SHOWING 1-10 OF 20 REFERENCES

Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design

- Computer Science
- 1988

A methodology is proposed that facilitates analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a function of certain system parameters to identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration.

Hierarchical blocking and data flow analysis for numerical linear algebra

- Computer ScienceProceedings SUPERCOMPUTING '90
- 1990

It is shown that data flow direction and leading dimensions are crucial factors in optimizing linear algebra programs and a novel blocking strategy called hierarchical blocking and data-flow analysis is proposed.

The cache performance and optimizations of blocked algorithms

- Computer ScienceASPLOS IV
- 1991

It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.

Parallel Algorithms for Dense Linear Algebra Computations

- Computer ScienceSIAM Rev.
- 1990

The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense, banded, or block-structured problems arising in the major areas of direct solution of linear systems, least squares computations, eigenvalue and singular value computation, and rapid elliptic solvers.

Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine

- Computer Science
- 1984

This paper examines common implementations of linear algebra algorithms, such as matrix-vector multiplication, matrix-matrix multiplication and the solution of linear equations for efficiency on a computer architecture which uses vector processing and has pipelined instruction execution.

Organizing matrices and matrix operations for paged memory systems

- Computer ScienceCACM
- 1969

It is shown that carefully designed matrix algorithms can lead to enormous savings in the number of page faults occurring when only a small part of the total matrix can be in main memory at one time.

Compiler blockability of numerical algorithms

- Computer ScienceProceedings Supercomputing '92
- 1992

An attempt was made to determine whether a compiler can automatically restructure computations well enough to avoid the need for hand blocking, and it was shown that knowledge about which operations commute can enable a compiler to succeed in blocking codes that could not be blocked by any compiler based strictly on dependence analysis.

LAPPACK Working Note No. 28: The IBM RISC System/6000 and Linear Algebra Operations

- Computer Science
- 1990

The performance of blocked algorithms commonly used in solving problems in numerical linear algebra on the IBM RISC System/6000 workstation are described and the techniques used in achieving high performance on such an architecture are discussed.

To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

- Computer ScienceSupercomputing '93
- 1993

Preliminary experimental results demonstrate that, because of the sensitivity of cache conflicts to small changes in problem size and base addresses, selective copying can lead to better overall performance than either no copying, complete copying, or copying based on manually applied heuristics.

The Design of the DEC 3000 AXP Systems, Two High-performance Workstations

- Computer ScienceDigit. Tech. J.
- 1992

A family of high-performance 64-bit RISC workstations and servers based on the new Digital Alpha AXP architecture is described. The hardware implementation uses the powerful new DECchip 21064 CPU and…