#### Filter Results:

- Full text PDF available (27)

#### Publication Year

1978

2014

#### Publication Type

#### Co-author

#### Publication Venue

#### Key Phrases

Learn More

This paper describes the ATLAS (Automatically Tuned Linear Algebra Software) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term AEOS (Automated Empirical Optimization of Software); this style of library management has been created… (More)

- R. Clint Whaley
- IMCSIT
- 2008

—LAPACK (Linear Algebra PACKage) is a statically cache-blocked library, where the blocking factor (NB) is determined by the service routine ILAENV. Users are encouraged to tune NB to maximize performance on their platform/BLAS (the BLAS are LAPACK's computational engine), but in practice very few users do so (both because it is hard, and because its… (More)

- R. Clint Whaley
- 1994

The BLACS Basic Linear Algebra Communication Subprograms project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented eeciently and uniformly across a large range of distributed memory platforms. The length of time required to implement eecient distributed memory algorithms makes it… (More)

- Anthony M. Castaldo, R. Clint Whaley, Anthony T. Chronopoulos
- SIAM J. Scientific Computing
- 2008

This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs between computational performance, memory usage, and error… (More)

- R. Clint Whaley
- Encyclopedia of Parallel Computing
- 2011

- R. Clint Whaley
- 1997

- Anthony M. Castaldo, R. Clint Whaley
- PPOPP
- 2010

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors… (More)

There are a few application areas which remain almost untouched by the historical and continuing advancement of compilation research. For the extremes of optimization required for high performance computing on one end, and embedded systems at the opposite end of the spectrum, many critical routines are still hand-tuned, often directly in assembly. At the… (More)

- R. Clint Whaley, Anthony M. Castaldo
- Softw., Pract. Exper.
- 2008

Key computational kernels must run near their peak efficiency for most high performance computing (HPC) applications. Getting this level of efficiency has always required extensive tuning of the kernel on a particular platform of interest. The success or failure of an optimization is usually measured by invoking a timer. Understanding how to build reliable… (More)

- Qing Yi, R. Clint Whaley
- LCSD '07
- 2007

The performance of many scientific applications depends on a small number of key computational kernels which require a level of efficiency rarely satisfied by existing native compilers. We present a new approach to high performance kernel optimization, where a general-purpose transformation engine automates the production of highly efficient library… (More)