Learn More
Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape(More)
The detritivorous fish, gizzard shad (Dorosoma cepedianum), provides nutrients to phytoplankton in reservoirs by ingesting organic detritus associated with sediments and excreting substantial quantities of nutrients such as N and P in soluble forms that are highly available to phytoplankton, We estimated nutrient excretion by gizzard shad in a eutrophic(More)
—The industry-standard Message Passing Interface (MPI) provides one-sided communication functionality and is available on virtually every parallel computing system. However, it is believed that MPI's one-sided model is not rich enough to support higher-level global address space parallel programming models. We present the first successful application of MPI(More)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to(More)
—BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the " GotoBLAS approach " to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel(More)
MPI communicators abstract communication operations across application modules, facilitating seamless composition of different libraries. In addition , communicators provide the ability to form groups of processes and establish multiple levels of parallelism. Traditionally, communicators have been collectively created in the context of the parent(More)
Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for(More)
SUMMARY The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations(More)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to(More)