#### Filter Results:

#### Publication Year

2007

2011

#### Publication Type

#### Co-author

#### Key Phrase

#### Publication Venue

Learn More

This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs between computational performance, memory usage, and error… (More)

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors… (More)

Key computational kernels must run near their peak efficiency for most high performance computing (HPC) applications. Getting this level of efficiency has always required extensive tuning of the kernel on a particular platform of interest. The success or failure of an optimization is usually measured by invoking a timer. Understanding how to build reliable… (More)

- R Clint Whaley, Anthony Chronopoulos, Carola Wenk, Hugh Maynard, Lucio Tavernini, Anthony M Castaldo +3 others
- 2010

Acknowledgements I want to thank my advisor, Dr. R. Clint Whaley, for an enormous amount of work on my behalf, and for his excellent academic advice. I want to thank my wife for her support and enthusiasm for my academic pursuits. The following two paragraphs are required text by UTSA: " This Masters Thesis/Recital Document or Doctoral Dissertation was… (More)

- R Clint Whaley, Anthony Chronopoulos, Carola Wenk, Lucio Tavernini, Anthony M Castaldo, Anthony Michael
- 2007

This thesis discusses both the theoretical and statistical errors obtained by various dot product algorithms. A host of linear algebra methods derive their error behavior directly from the dot product. In particular, many high performance dense systems derive their performance and error behavior overwhelmingly from matrix multiply, and matrix multiply's… (More)

—Using the well-known ATLAS and LAPACK dense linear algebra libraries, we demonstrate that the parallel management overhead (PMO) can grow with problem size on even statically scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these thread management issues can be ignored in such computationally intensive… (More)

—Much of dense linear algebra has been successfully blocked to concentrate the majority of its time in the Level 3 BLAS, which are not only efficient for serial computation, but also scale well for parallelism. For the Hessenberg factorization, which is a critical step in computing the eigenvalues and vectors, however, performance of the best known… (More)

- Anthony M Castaldo, R Clint Whaley
- 2009

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors… (More)

- ‹
- 1
- ›