Hierarchical N-body Simulations with Autotuning for Heterogeneous Systems

  title={Hierarchical N-body Simulations with Autotuning for Heterogeneous Systems},
  author={Rio Yokota and Lorena A. Barba},
  journal={Computing in Science \& Engineering},
  • Rio Yokota, L. Barba
  • Published 29 August 2011
  • Computer Science
  • Computing in Science & Engineering
Algorithms designed to efficiently solve the classical N-body problem of mechanics fit well on GPU hardware and exhibit excellent scalability on many GPUs. Their computational intensity makes them a promising approach for other applications amenable to an N-body formulation. Adding features such as autotuning makes multipole-type algorithms ideal for heterogeneous computing environments. 

Figures from this paper

High performance CPU/GPU multiresolution Poisson solver
The algorithmic improvements together with software optimization techniques result in 80% and 97% of the upper bound performance for the CPU and GPU parts, respectively, on a single Cray XK7 compute node.
Scaling fast multipole methods up to 4000 GPUs
A 1 PFlop/s calculation of isotropic turbulence with 64 billion vortex particles using 4096 GPUs on the TSUBAME 2.0 system is presented.
Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems
This paper proposes optimal task split between CPU and GPU where GPU is only used to compute the calculation of the particle force, and describes optimization techniques such as control of the force accuracy, vectorized tree walk, and work partitioning among multiple GPUs.
GPU Accelerated Fast Multipole Methods for Dynamic N -body Simulation
This paper provides efficient datastructures implemented on Graphical Processing Units (GPUs), and a novel parallel formulation of the FMM on GPUs to address the so-called N -body problem.
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs
Data‐driven execution of fast multipole methods
The authors discuss in the paper another approach based on data‐driven execution to efficiently tackle this challenging load balancing problem, which consists of breaking the most time‐consuming stages of the FMMs into smaller tasks.
An FMM Based on Dual Tree Traversal for Many-Core Architectures
The present work attempts to integrate the independent efforts in the fast N-body community to create the fastest N-body library for many-core and heterogenous architectures. Focus is placed on low
An (almost) direct deployment of the Fast Multipole Method on the Cell processor
This paper presents the first deployment of the Fast Multipole Method on the Cell processor (PowerXCell 8i) in single and double precisions, which scales linearly on several Cell blades and which is able to handle both uniform and non-uniform distributions of particles.


A special-purpose computer for gravitational many-body problems
A processor has been constructed using a 'pipeline' architecture to simulate many-body systems with long-range forces and can be adapted to study molecular dynamics, plasma dynamics and astrophysical hydrodynamics with only minor modifications.
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence
The present method calculates the O(N log N) treecode and O (N) fast multipole method (FMM) on the GPUs with unprecedented efficiency and demonstrates the performance of the method by choosing one standard application -a gravitational N-body simulation- and one non-standard application -simulation of turbulence using vortex particles.
Astrophysical N-body simulations using hierarchical tree data structures
The authors report on recent large astrophysical N-body simulations executed on the Intel Touchstone Delta system. They review the astrophysical motivation and the numerical techniques and discuss
A Hierarchical O(N) Force Calculation Algorithm
A novel code for the approximate computation of long-range forces between N mutually interacting bodies is presented. The code is based on a hierarchical tree of cubic cells and features mutual
Bottom-Up Construction and 2: 1 Balance Refinement of Linear Octrees in Parallel
New parallel algorithms for the construction and 2:1 balance refinement of large linear octrees on distributed memory machines, used in many problems in computational science and engineering, are proposed.
The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units
An algorithm named "Chamomile Scheme" is presented, fully optimized for calculating gravitational interactions on the latest programmable Graphics Processing Unit (GPU), NVIDIA GeForce8800GTX, which has small but fast shared memories and floating point arithmetic hardware but only for single precision.