MOB forms: a class of multilevel block algorithms for dense linear algebra operations

@inproceedings{Navarro1994MOBFA,
  title={MOB forms: a class of multilevel block algorithms for dense linear algebra operations},
  author={Juan J. Navarro and Toni Juan and Tom{\'a}s Lang},
  booktitle={ICS '94},
  year={1994}
}
Multilevel block algorithms exploit the data locality in linear algebra operations when executed in machines with several levels in the memory hierarchy. It is shown that the family we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design and that using the multilevel approach produces significant performance improvements. The effect of interference in the cache, of the TLB misses, and of page faults are also considered. The multilevel block algorithms are evaluated… 
Deliverable HwA 5 b : Multilevel Blocking and Prefetching for LinearAlgebra
TLDR
This research is aimed to use the multilevel orthogonal blocking approach in conjuntion with other software techniques to further improve the performance of linear algebra computations.
Block Algorithms to speed up the Sparse Matrix by Dense Matrix Multiplication on High Performance Wo
TLDR
This research is aimed to use the multilevel orthogonal blocking approach in conjuntion with other software techniques to further improve the performance of linear algebra computations.
Block algorithms for sparse matrix computations on high performance workstations
TLDR
This paper analyzes the use of Blocklng, Data Precopying and Software Pipelining to improve the performance of sparse matrix computations on superscalar workstations and shows that there is a clear difference between the dense case and the sparse case in terms of the compromises to be adopted to optimize the algorithms.
Data Prefetching for Linear Algebra Operations on High Performance Workstations
TLDR
The performance of the dense matrix by matrix multiplication executed on a super-scalar high performance workstation is improved using binding and nonbinding prefetching to hide the memory latency together with the well known technique of blocking.
Data prefetching and multilevel blocking for linear algebra operations
TLDR
This paper analyzes the behavior of matrix multiplication algorithms for large matrices on a superscalar and superpipelined processor with a multilevel memory hierarchy when these techniques are applied together, and compares two different approaches to data prefetching, binding versus non-binding, and finds the latter remarkably more effective than the former due mainly to its flexibility.
A framework for efficient execution of matrix computations
TLDR
This work presents an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time, and shows that techniques used in linear algebra codes can be useful in other fields.
Multilevel Blocking in Complex Iteration Spaces
TLDR
A technique to perform the loop interchange in non-convex iteration spaces that computes the loop bounds exactly and an order in which to perform index set splitting that guaranties that each loop in the nest will be processed only once and also avoids code explosion are proposed.
Block Algorithms forSparse Matrix by Dense Matrix
TLDR
The perfomance of forms without blocking is determined and the improvement that can be obtained by using two levels of blocking (at the register and cache levels) is shown.
Exploitation of Multilevel Parallelism on Structured Linear Systems
TLDR
It is proved that it is necessary for software designers and programmers to have a profound knowledge of the architecture and programming tools of present computers in order to get a good exploitation of their resources.
A framework for high‐performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low‐level kernels
TLDR
Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm, and the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high‐performance code is exposed.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design
TLDR
A methodology is proposed that facilitates analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a function of certain system parameters to identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration.
Hierarchical blocking and data flow analysis for numerical linear algebra
TLDR
It is shown that data flow direction and leading dimensions are crucial factors in optimizing linear algebra programs and a novel blocking strategy called hierarchical blocking and data-flow analysis is proposed.
The cache performance and optimizations of blocked algorithms
TLDR
It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Parallel Algorithms for Dense Linear Algebra Computations
TLDR
The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense, banded, or block-structured problems arising in the major areas of direct solution of linear systems, least squares computations, eigenvalue and singular value computation, and rapid elliptic solvers.
Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine
TLDR
This paper examines common implementations of linear algebra algorithms, such as matrix-vector multiplication, matrix-matrix multiplication and the solution of linear equations for efficiency on a computer architecture which uses vector processing and has pipelined instruction execution.
Organizing matrices and matrix operations for paged memory systems
TLDR
It is shown that carefully designed matrix algorithms can lead to enormous savings in the number of page faults occurring when only a small part of the total matrix can be in main memory at one time.
Compiler blockability of numerical algorithms
TLDR
An attempt was made to determine whether a compiler can automatically restructure computations well enough to avoid the need for hand blocking, and it was shown that knowledge about which operations commute can enable a compiler to succeed in blocking codes that could not be blocked by any compiler based strictly on dependence analysis.
LAPPACK Working Note No. 28: The IBM RISC System/6000 and Linear Algebra Operations
TLDR
The performance of blocked algorithms commonly used in solving problems in numerical linear algebra on the IBM RISC System/6000 workstation are described and the techniques used in achieving high performance on such an architecture are discussed.
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts
TLDR
Preliminary experimental results demonstrate that, because of the sensitivity of cache conflicts to small changes in problem size and base addresses, selective copying can lead to better overall performance than either no copying, complete copying, or copying based on manually applied heuristics.
The Design of the DEC 3000 AXP Systems, Two High-performance Workstations
A family of high-performance 64-bit RISC workstations and servers based on the new Digital Alpha AXP architecture is described. The hardware implementation uses the powerful new DECchip 21064 CPU and
...
...