A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

  title={A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures},
  author={Alfredo Buttari and Julien Langou and Jakub Kurzak and Jack J. Dongarra},
  journal={Parallel Comput.},

Figures and Tables from this paper

Towards a Parallel Tile LDL Factorization for Multicore Architectures
An algorithm to compute the LDLt factorization of symmetric indefinite matrices without taking pivoting into consideration is presented, which allows an out of order execution of tasks that removes the intrinsically sequential nature of the factorization.
Tiled Algorithms for Matrix Computations on Multicore Architectures
This thesis is to study tiled algorithms in a multi/many-core setting and to provide new algorithms which exploit the current architecture to improve performance relative to current state-of-the-art libraries while maintaining the stability and robustness of these libraries.
The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures
The computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices is reported and compared with the respective LU factorization from a vendor implemented LAPACK library.
Strategies to optimize the LU factorization algorithm on multicore computers
This study presents complex strategies that merge double levels of parallelism with asynchronous scheduling whose results reach up to the State-of-the-art in the field and even go further.
Multifrontal QR Factorization for Multicore Architectures over Runtime Systems
This paper evaluates the usability of runtime systems for complex applications, namely, sparse matrix multifrontal factorizations which constitute extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption.
Parallel Computation of Echelon Forms
This work proposes efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field and compares several block decompositions: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive.
Scheduling two-sided transformations using tile algorithms on multicore architectures
Three different scheduler implementations for the two-sided linear algebra transformations are described, in the context of multicore architectures, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively.
Scheduling Linear Algebra Operations on Multicore Processors –
Two emerging approaches to implementing coarse-grain dataflow are examined, the model of nested parallelism, represented by the Cilk framework, and themodel of parallelism expressed through an arbitrary Direct Acyclic Graph,represented by the SMP Superscalar framework.
Design of a Multicore Sparse Cholesky Factorization Using DAGs
This work considers the solution of sparse symmetric positive-definite linear systems by Cholesky factorization and designs a new efficient and portable solver, HSL_MA87, which performs well, particularly in the case of very large problems.
Some issues in dense linear algebra for multicore and special purpose architectures
We address some key issues in designing dense linear algebra (DLA) algorithms that are common for both multi/many-cores and special purpose architectures (in particular GPUs). We present them in the


Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
This paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures, and shows that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks.
Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead
A pipelined model of parallel execution is presented, and the idea of look ahead is utilized in order to suppress the negative effects of sequential formulation of the algorithms.
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
It is argued that traditional implementations of dense linear algebra matrix operations on SMP architectures cannot be easily modified to render high performance as well as scalability on these architectures, and the solution is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data.
Vector and parallel algorithms for Cholesky factorization on IBM 3090
  • R. AgarwalF. Gustavson
  • Computer Science
    Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89)
  • 1989
Various blocking schemes are described for vector and parallel implementation on 3090 VF and some of these algorithms have been included in the Engineering and Scientific Subroutine Library (ESSL).
Parallel Algorithms for Dense Linear Algebra Computations
The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense, banded, or block-structured problems arising in the major areas of direct solution of linear systems, least squares computations, eigenvalue and singular value computation, and rapid elliptic solvers.
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
A hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m=n increases from 100 to 1000 and an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations.
Parallel out-of-core computation and updating of the QR factorization
This article discusses the high-performance parallel implementation of the computation and updating of QR factorizations of dense matrices, including problems large enough to require out-of-core
Applying recursion to serial and parallel QR factorization leads to better performance
A hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices is introduced.
Minimal Data Copy for Dense Linear Algebra Factorization
A new result is described that shows that representing a matrix A as a collection of square blocks will reduce the amount of data reformating required by dense linear algebra factorization algorithms from O(n3) to O( n2).
QR Factorization for the CELL Processor
It is demonstrated how the potential of the CELL processor can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of theCELL processor in terms of Instruction Level Parallelism and Thread-Level Parallelism.