Scalable fast multipole methods on distributed heterogeneous architectures

Authors:
Qi Hu;Nail A. Gumerov;Ramani Duraiswami
Affiliations:
University of Maryland, College Park;University of Maryland, College Park;University of Maryland, College Park
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 15
Cited 5

A fast algorithm for particle simulations

Journal of Computational Physics
Scans as Primitive Parallel Operations

IEEE Transactions on Computers
A parallel adaptive fast multipole method

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity

Journal of Parallel and Distributed Computing
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation

SIAM Journal on Scientific Computing
$7.0/Mflops astrophysical N-body simulation with treecode on GRAPE-5

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A fast adaptive multipole algorithm in three dimensions

Journal of Computational Physics
Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A kernel-independent adaptive fast multipole algorithm in two and three dimensions

Journal of Computational Physics
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
$158/GFLOPS astrophysical N-body simulation with reconfigurable add-in card and hierarchical tree algorithm

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Fast multipole methods on graphics processors

Journal of Computational Physics
A massively parallel adaptive fast-multipole method on heterogeneous architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Scaling fast multipole methods up to 4000 GPUs

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient FMM accelerated vortex methods in three dimensions via the Lamb-Helmholtz decomposition

Journal of Computational Physics
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.