Jack Poulson

Learn More
Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape(More)
To implement dense linear algebra algorithms for distributed-memory computers, an expert applies knowledge of the domain, the target architecture, and how to parallelize common operations. This is often a rote process that becomes tedious for a large collection of algorithms. We have developed a way to encode this expert knowledge such that it can be(More)
In spite of an extensive literature on fast algorithms for synthetic aperture radar (SAR) imaging, it is not currently known if it is possible to accurately form an image from N data points in provable near-linear time complexity. This paper seeks to close this gap by proposing an algorithm which runs in complexity O(N logN log(1/ )) without making the(More)
We describe an extension of the Scalable Universal Matrix Multiplication Algorithms (SUMMA) from 2D to 3D process grids; the underlying idea is to lower the communication volume through storing redundant copies of one or more matrices. While SUMMA was originally introduced for block-wise matrix distributions, so that most of its communication was within(More)
A parallelization of a sweeping preconditioner for 3D Helmholtz equations without large cavities is introduced and benchmarked for several challenging velocity models. The setup and application costs of the sequential preconditioner are shown to be O(γN) and O(γN logN), where γ(ω) denotes the modestly frequency-dependent number of grid points per Perfectly(More)
The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform ∫ Rd K(x, y)g(y)dy at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(N) source and target points,(More)
We present a parallel preconditioning method for the iterative solution of the time-harmonic elastic wave equation which makes use of higher-order spectral elements to reduce pollution error. In particular, the method leverages perfectly matched layer boundary conditions to efficiently approximate the Schur complement matrices of a block LDL factorization.(More)
A message passing, distributed-memory parallel computer on a chip is one possible design for future, many-core architectures. We discuss initial experiences with the Intel Single-chip Cloud Computer research processor, which is a prototype architecture that incorporates 48 cores on a single die that can communicate via a small, shared, on-die buffer. The(More)