#### Filter Results:

#### Publication Year

1976

2015

#### Publication Type

#### Co-author

#### Publication Venue

#### Key Phrases

Learn More

Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the… (More)

We present two new algorithms, ADT and MDT, for solving order-n Toeplitz systems of linear equations T z = b in time O(n log 2 n) and space O(n). The fastest algorithms previously known, such as Trench's algorithm, require time Ω(n 2) and require that all principal submatrices of T be nonsingular. Our algorithm ADT requires only that T be nonsingular. Both… (More)

Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition… (More)

Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article… (More)

SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations, and personal computers. It supports in-core computations… (More)

Applying recursion to serial and parallel QR factorization leads to better performance We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations.… (More)

We present a new recursive algorithm for the QR factoriza-tion of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the eecient use of the recursion for large… (More)

A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are conngured as a \virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensions|M, N, and K. Each processor performs a single local matrix multiplication of size M=p 1 N=p 2 K=p 3. Before… (More)

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem… (More)