Learn More
Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the(More)
We present two new algorithms, ADT and MDT, for solving order-n Toeplitz systems of linear equations T z = b in time O(n log 2 n) and space O(n). The fastest algorithms previously known, such as Trench's algorithm, require time Ω(n 2) and require that all principal submatrices of T be nonsingular. Our algorithm ADT requires only that T be nonsingular. Both(More)
Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article(More)
SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations, and personal computers. It supports in-core computations(More)
Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition(More)
Applying recursion to serial and parallel QR factorization leads to better performance We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations.(More)
Let A and B be two sparse matrices whose orders are p by q and q by r. Their product CAB requires N nontrlvial multiplications where 0 <_ N <_ pqr. The operation count of our algorithm is usually proportional to N; however, its worse case is O(p, r, NA, N) where NA is the number of elements in A This algorithm can be used to assemble the sparse matrix(More)
We present a new recursive algorithm for the QR factoriza-tion of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the eecient use of the recursion for large(More)
A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are conngured as a \virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensions|M, N, and K. Each processor performs a single local matrix multiplication of size M=p 1 N=p 2 K=p 3. Before(More)
Recursive blocked data formats and recursive blocked BLAS's are introduced and applied to dense linear algebra algorithms that are typiied by LAPACK. The new data formats allow for maintaining data locality at every level of the memory hierarchy and hence providing high performance on today's memory tiered processors. This new data format is hybrid. It(More)