Learn More
As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical <i>N</i>-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous <i>N</i>-body simulations on GPUs that scale as <i>O</i>(<i>N</i><sup>2</sup>), the present method calculates the <i>O</i>(<i>N</i> log(More)
Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation super-computers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the(More)
We have developed a parallel algorithm for radial basis function (rbf) interpolation that exhibits O(N) complexity, requires O(N) storage, and scales excellently up to a thousand processes. The algorithm uses a gmres iterative solver with a restricted additive Schwarz method (rasm) as a preconditioner and a fast matrix-vector algorithm. Previous fast rbf(More)
Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly(More)
We present teraflop-scale calculations of biomolecular electrostatics enabled by the combination of algorithmic and hardware acceleration. The algorithmic acceleration is achieved with the fast multipole method (fmm) in conjunction with a boundary element method (bem) formulation of the continuum electrostatic model, as well as the bibee approximation to(More)
Algorithms designed to efficiently solve the classical N-body problem of mechanics fit well on GPU hardware and exhibit excellent scalability on many GPUs. Their computational intensity makes them a promising approach for other applications amenable to an N-body formulation. Adding features such as autotuning makes multipole-type algorithms ideal for(More)
Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of(More)
—We present a 0.5 Petaflop/s calculation of homogeneous isotropic turbulence in a cube of 2048 3 particles, using a highly parallel fast multipole method (FMM) using 2048 GPUs on the TSUBAME 2.0 system. We compare this particle-based code with a spectral DNS code under the same calculation condition and the same machine. The results of our particle-based(More)