Learn More
We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems,(More)
This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP(More)
We describe our software package Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) publicly released recently. BLOPEX is available as a stand-alone serial library, as an external package to PETSc (“Portable, Extensible Toolkit for Scientific Computation”, a general purpose suite of tools for the scalable solution of partial differential(More)
We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 200 million deformable RBCs. The(More)
Rahul S. Sampath, Santi S. Adavani, Hari Sundar, Ilya Lashuk, and George Biros University of Pennsylvania Abstract In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial differential equations that require discretization of second-order elliptic operators. Dendro uses trilinear finite element(More)
We analyze the conjugate gradient (CG) method with variable preconditioning for solving a linear system with a real symmetric positive definite (SPD) matrix of coefficients A. We assume that the preconditioner is SPD on each step, and that the condition number of the preconditioned system matrix is bounded above by a constant independent of the step number.(More)
We present preliminary results of an ongoing project to develop codes of the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method for symmetric eigenvalue problems for hypre and PETSc software packages. hypre and PETSc provide high quality domain decomposition and multigrid preconditioning for parallel computers. Our LOBPCG implementation(More)
We present preliminary results of an ongoing project to develop codes of the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method for symmetric eigenvalue problems for hypre and PETSc software packages. hypre and PETSc provide high quality domain decomposition and multigrid preconditioning for parallel computers. Our LOBPCG implementation(More)